resistz nielsr HF Staff commited on
Commit
38b633f
·
verified ·
1 Parent(s): 6604b46

Update model card for CoReward-Qwen2.5-3B: Add metadata, paper link, and fix GitHub URL (#1)

Browse files

- Update model card for CoReward-Qwen2.5-3B: Add metadata, paper link, and fix GitHub URL (d02e816bfc25d1ba8a013867287132a8ccecc88c)


Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show
  1. README.md +34 -6
README.md CHANGED
@@ -1,18 +1,46 @@
1
  ---
2
  license: mit
 
 
3
  ---
4
- ## CoReward-Qwen2.5-3B
5
 
6
- This is the Qwen2.5-3B model trained by Co-Reward method using MATH training set.
7
 
8
- If you are interested in Co-Reward, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-Reward].
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9
 
10
  ## Citation
11
- ```
 
12
  @article{zhang2025coreward,
13
- title={Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement},
14
  author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
15
- journal={arXiv preprint arXiv:2508.00410}
16
  year={2025},
17
  }
18
  ```
 
1
  ---
2
  license: mit
3
+ pipeline_tag: text-generation
4
+ library_name: transformers
5
  ---
 
6
 
7
+ # Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
8
 
9
+ This repository contains the **CoReward-Qwen2.5-3B** model, a Qwen2.5-3B model trained using the Co-rewarding method on the MATH training set. This work was presented in the paper:
10
+
11
+ [**Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models**](https://huggingface.co/papers/2508.00410)
12
+
13
+ For the official code repository, training scripts, and further details, please visit:
14
+ [**GitHub: tmlr-group/Co-rewarding**](https://github.com/tmlr-group/Co-rewarding)
15
+
16
+ <p align="center">
17
+ <a href="https://arxiv.org/pdf/2508.00410">
18
+ <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2508.00410-b31b1b?logo=arxiv&logoColor=white" height="20">
19
+ </a>
20
+ &nbsp;&nbsp;
21
+ <a href="https://github.com/tmlr-group/Co-rewarding/stargazers">
22
+ <img alt="GitHub Stars" src="https://img.shields.io/github/stars/resistzzz/Co-rewarding?style=social" height="20">
23
+ </a>
24
+ </p>
25
+
26
+ ![Co-rewarding Framework](https://github.com/tmlr-group/Co-rewarding/raw/main/figs/Method.png)
27
+
28
+ **Co-rewarding** is a novel self-supervised Reinforcement Learning (RL) framework designed to improve training stability by seeking complementary supervision from multiple views. This approach addresses the common training collapse issue found in single-view self-rewarding methods. Specifically, Co-rewarding is instantiated in two ways:
29
+ 1. **Co-rewarding-I**: A data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions.
30
+ 2. **Co-rewarding-II**: A model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation.
31
+
32
+ These instantiations introduce different levels of discrepancy, making it harder for the training to collapse on trivial reasoning solutions. Empirically, Co-rewarding demonstrates stable training and significantly outperforms other self-rewarding baselines on multiple mathematical reasoning benchmarks.
33
+
34
+ ## Checkpoints and Datasets
35
+ A comprehensive list of all checkpoints trained using Co-rewarding, including various model sizes and baselines on MATH, DAPO-14k, and OpenRS datasets, can be found in the [Checkpoints section of the GitHub repository](https://github.com/tmlr-group/Co-rewarding#checkpoints). The rephrased datasets are also available and linked in the GitHub README.
36
 
37
  ## Citation
38
+ If you use our datasets or models, please cite our paper:
39
+ ```bibtex
40
  @article{zhang2025coreward,
41
+ title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
42
  author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
43
+ journal={arXiv preprint arXiv:2508.00410},
44
  year={2025},
45
  }
46
  ```