Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

This repository contains the CoReward-Qwen2.5-3B model, a Qwen2.5-3B model trained using the Co-rewarding method on the MATH training set. This work was presented in the paper:

Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models

For the official code repository, training scripts, and further details, please visit: GitHub: tmlr-group/Co-rewarding

Co-rewarding is a novel self-supervised Reinforcement Learning (RL) framework designed to improve training stability by seeking complementary supervision from multiple views. This approach addresses the common training collapse issue found in single-view self-rewarding methods. Specifically, Co-rewarding is instantiated in two ways:

Co-rewarding-I: A data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions.
Co-rewarding-II: A model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation.

These instantiations introduce different levels of discrepancy, making it harder for the training to collapse on trivial reasoning solutions. Empirically, Co-rewarding demonstrates stable training and significantly outperforms other self-rewarding baselines on multiple mathematical reasoning benchmarks.

Checkpoints and Datasets

A comprehensive list of all checkpoints trained using Co-rewarding, including various model sizes and baselines on MATH, DAPO-14k, and OpenRS datasets, can be found in the Checkpoints section of the GitHub repository. The rephrased datasets are also available and linked in the GitHub README.

Citation

If you use our datasets or models, please cite our paper:

@article{zhang2025coreward,
      title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
      author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
      journal={arXiv preprint arXiv:2508.00410},
      year={2025},
}

Downloads last month: 15

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATH

Quantizations

2 models

Collection including TMLR-Group-HF/Co-rewarding-I-Qwen2.5-3B-MATH

Co-rewarding

Collection

Co-rewarding is a novel self-supervised RL framework that improves training stability by seeking complementary supervision from another views. • 69 items • Updated 28 days ago • 1