RL - a sh110495 Collection

sh110495 's Collections

RL

updated Oct 31, 2024

WPO: Enhancing RLHF with Weighted Preference Optimization

Paper • 2406.11827 • Published Jun 17, 2024 • 17
Self-Improving Robust Preference Optimization

Paper • 2406.01660 • Published Jun 3, 2024 • 20
Bootstrapping Language Models with DPO Implicit Rewards

Paper • 2406.09760 • Published Jun 14, 2024 • 41
BPO: Supercharging Online Preference Learning by Adhering to the Proximity of Behavior LLM

Paper • 2406.12168 • Published Jun 18, 2024 • 7
Understanding and Diagnosing Deep Reinforcement Learning

Paper • 2406.16979 • Published Jun 23, 2024 • 10
Step-DPO: Step-wise Preference Optimization for Long-chain Reasoning of LLMs

Paper • 2406.18629 • Published Jun 26, 2024 • 42
Understand What LLM Needs: Dual Preference Alignment for Retrieval-Augmented Generation

Paper • 2406.18676 • Published Jun 26, 2024 • 6
Step-Controlled DPO: Leveraging Stepwise Error for Enhanced Mathematical Reasoning

Paper • 2407.00782 • Published Jun 30, 2024 • 24
Direct Preference Knowledge Distillation for Large Language Models

Paper • 2406.19774 • Published Jun 28, 2024 • 22
Understanding Reference Policies in Direct Preference Optimization

Paper • 2407.13709 • Published Jul 18, 2024 • 17
Self-Training with Direct Preference Optimization Improves Chain-of-Thought Reasoning

Paper • 2407.18248 • Published Jul 25, 2024 • 33
JudgeBench: A Benchmark for Evaluating LLM-based Judges

Paper • 2410.12784 • Published Oct 16, 2024 • 47