WPO: Enhancing RLHF with Weighted Preference Optimization
Paper
• 2406.11827
• Published • 17
Self-Improving Robust Preference Optimization
Paper
• 2406.01660
• Published • 20
Bootstrapping Language Models with DPO Implicit Rewards
Paper
• 2406.09760
• Published • 41
BPO: Supercharging Online Preference Learning by Adhering to the
Proximity of Behavior LLM
Paper
• 2406.12168
• Published • 7
Understanding and Diagnosing Deep Reinforcement Learning
Paper
• 2406.16979
• Published • 10
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of
LLMs
Paper
• 2406.18629
• Published • 42
Understand What LLM Needs: Dual Preference Alignment for
Retrieval-Augmented Generation
Paper
• 2406.18676
• Published • 6
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical
Reasoning
Paper
• 2407.00782
• Published • 24
Direct Preference Knowledge Distillation for Large Language Models
Paper
• 2406.19774
• Published • 22
Understanding Reference Policies in Direct Preference Optimization
Paper
• 2407.13709
• Published • 17
Self-Training with Direct Preference Optimization Improves
Chain-of-Thought Reasoning
Paper
• 2407.18248
• Published • 33
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Paper
• 2410.12784
• Published • 47