--- license: mit pipeline_tag: text-generation library_name: transformers --- # Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models This repository contains the **CoReward-Qwen2.5-3B** model, a Qwen2.5-3B model trained using the Co-rewarding method on the MATH training set. This work was presented in the paper: [**Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models**](https://huggingface.co/papers/2508.00410) For the official code repository, training scripts, and further details, please visit: [**GitHub: tmlr-group/Co-rewarding**](https://github.com/tmlr-group/Co-rewarding)

![Co-rewarding Framework](https://github.com/tmlr-group/Co-rewarding/raw/main/figs/Method.png) **Co-rewarding** is a novel self-supervised Reinforcement Learning (RL) framework designed to improve training stability by seeking complementary supervision from multiple views. This approach addresses the common training collapse issue found in single-view self-rewarding methods. Specifically, Co-rewarding is instantiated in two ways: 1. **Co-rewarding-I**: A data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions. 2. **Co-rewarding-II**: A model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation. These instantiations introduce different levels of discrepancy, making it harder for the training to collapse on trivial reasoning solutions. Empirically, Co-rewarding demonstrates stable training and significantly outperforms other self-rewarding baselines on multiple mathematical reasoning benchmarks. ## Checkpoints and Datasets A comprehensive list of all checkpoints trained using Co-rewarding, including various model sizes and baselines on MATH, DAPO-14k, and OpenRS datasets, can be found in the [Checkpoints section of the GitHub repository](https://github.com/tmlr-group/Co-rewarding#checkpoints). The rephrased datasets are also available and linked in the GitHub README. ## Citation If you use our datasets or models, please cite our paper: ```bibtex @article{zhang2025coreward, title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models}, author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han}, journal={arXiv preprint arXiv:2508.00410}, year={2025}, } ```