Fine Tuning + Reasoning + Rl | hypedar

Fine Tuning + Reasoning + Rl

22.0

Create a stable RL training framework that uses logical structural constraints instead of just final answers. This will drastically improve reasoning depth for OSS LLMs.

+0

emergingimplementation gap

fine-tuningrlreasoningrlhf

Signals (31)

Distributionally Robust Token Optimization in RLHF

SepSeq: A Training-Free Framework for Long Numerical Sequence Processing in LLMs

CAMO: A Class-Aware Minority-Optimized Ensemble for Robust Language Model Evaluation on Imbalanced Data

Cross-Lingual Transfer and Parameter-Efficient Adaptation in the Turkic Language Family: A Theoretical Framework for Low-Resource Language Models

unslothai/unsloth

The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment

Weakly Supervised Distillation of Hallucination Signals into Transformer Representations

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Cross-Tokenizer LLM Distillation through a Byte-Level Interface

Xpertbench: Expert Level Tasks with Rubrics-Based Evaluation

Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

Decomposing the Delta: What Do Models Actually Learn from Preference Pairs?

LLM-Augmented Knowledge Base Construction For Root Cause Analysis

GRASS: Gradient-based Adaptive Layer-wise Importance Sampling for Memory-efficient Large Language Model Fine-tuning

Small models also found the vulnerabilities that Mythos found

Self-Execution Simulation Improves Coding Models

Sensitivity-Positional Co-Localization in GQA Transformers

Revealing the Learning Dynamics of Long-Context Continual Pre-training

NVIDIA-NeMo/DataDesigner

StaRPO: Stability-Augmented Reinforcement Policy Optimization

Understanding the Nature of Generative AI as Threshold Logic in High-Dimensional Space

Google AI3d ago

ConvApparel: Measuring and bridging the realism gap in user simulators

Temporally Phenotyping GLP-1RA Case Reports with Large Language Models: A Textual Time Series Corpus and Risk Modeling

Reinforcement Learning-based Knowledge Distillation with LLM-as-a-Judge

Limits of Difficulty Scaling: Hard Samples Yield Diminishing Returns in GRPO-Tuned SLMs

Consistency-Guided Decoding with Proof-Driven Disambiguation for Three-Way Logical Question Answering

Borges' cartographers and the tacit skill of reading LM output

Memory Dial: A Training Framework for Controllable Memorization in Language Models

Beyond Facts: Benchmarking Distributional Reading Comprehension in Large Language Models

shiyu-coder/Kronos

Blind Refusal: Language Models Refuse to Help Users Evade Unjust, Absurd, and Illegitimate Rules