Direct Nash Optimization: Teaching Language Models to Self-Improve with General Preferences Paper • 2404.03715 • Published Apr 4, 2024 • 62
Exploratory Preference Optimization: Harnessing Implicit Q*-Approximation for Sample-Efficient RLHF Paper • 2405.21046 • Published May 31, 2024 • 4
Interpretable Preferences via Multi-Objective Reward Modeling and Mixture-of-Experts Paper • 2406.12845 • Published Jun 18, 2024 • 1
Trajectory Bellman Residual Minimization: A Simple Value-Based Method for LLM Reasoning Paper • 2505.15311 • Published May 21, 2025
Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States Paper • 2603.19987 • Published 13 days ago • 9
Understanding Behavior Cloning with Action Quantization Paper • 2603.20538 • Published 12 days ago • 2
Understanding Behavior Cloning with Action Quantization Paper • 2603.20538 • Published 12 days ago • 2
Breaking the Capability Ceiling of LLM Post-Training by Reintroducing Markov States Paper • 2603.19987 • Published 13 days ago • 9
Semi-Supervised Reward Modeling via Iterative Self-Training Paper • 2409.06903 • Published Sep 10, 2024 • 1
Towards Understanding the Fragility of Multilingual LLMs against Fine-Tuning Attacks Paper • 2410.18210 • Published Oct 23, 2024
MergeBench: A Benchmark for Merging Domain-Specialized LLMs Paper • 2505.10833 • Published May 16, 2025 • 2
Scalable Data Synthesis for Computer Use Agents with Step-Level Filtering Paper • 2512.10962 • Published Nov 22, 2025 • 3
Localize-and-Stitch: Efficient Model Merging via Sparse Task Arithmetic Paper • 2408.13656 • Published Aug 24, 2024
Beyond 'Aha!': Toward Systematic Meta-Abilities Alignment in Large Reasoning Models Paper • 2505.10554 • Published May 15, 2025 • 120
Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL Paper • 2505.02391 • Published May 5, 2025 • 25
BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation Paper • 2502.03860 • Published Feb 6, 2025 • 25