Update model card for CoReward-Qwen2.5-3B: Add metadata, paper link, and fix GitHub URL (#1)

Browse files

- Update model card for CoReward-Qwen2.5-3B: Add metadata, paper link, and fix GitHub URL (d02e816bfc25d1ba8a013867287132a8ccecc88c)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md +34 -6

README.md CHANGED Viewed

@@ -1,18 +1,46 @@
 ---
 license: mit
 ---
-## CoReward-Qwen2.5-3B
-This is the Qwen2.5-3B model trained by Co-Reward method using MATH training set.
-If you are interested in Co-Reward, you can find more details on our Github Repo [https://github.com/tmlr-group/Co-Reward].
 ## Citation
-```
 @article{zhang2025coreward,
-      title={Co-Reward: Self-supervised Reinforcement Learning for Large Language Model Reasoning via Contrastive Agreement},
       author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
-      journal={arXiv preprint arXiv:2508.00410}
       year={2025},
 }
 ```

 ---
 license: mit
+pipeline_tag: text-generation
+library_name: transformers
 ---
+# Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models
+This repository contains the **CoReward-Qwen2.5-3B** model, a Qwen2.5-3B model trained using the Co-rewarding method on the MATH training set. This work was presented in the paper:
+[**Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models**](https://huggingface.co/papers/2508.00410)
+For the official code repository, training scripts, and further details, please visit:
+[**GitHub: tmlr-group/Co-rewarding**](https://github.com/tmlr-group/Co-rewarding)
+<p align="center">
+  <a href="https://arxiv.org/pdf/2508.00410">
+    <img alt="arXiv" src="https://img.shields.io/badge/arXiv-2508.00410-b31b1b?logo=arxiv&logoColor=white" height="20">
+  </a>
+  &nbsp;&nbsp;
+  <a href="https://github.com/tmlr-group/Co-rewarding/stargazers">
+    <img alt="GitHub Stars" src="https://img.shields.io/github/stars/resistzzz/Co-rewarding?style=social" height="20">
+  </a>
+</p>
+![Co-rewarding Framework](https://github.com/tmlr-group/Co-rewarding/raw/main/figs/Method.png)
+**Co-rewarding** is a novel self-supervised Reinforcement Learning (RL) framework designed to improve training stability by seeking complementary supervision from multiple views. This approach addresses the common training collapse issue found in single-view self-rewarding methods. Specifically, Co-rewarding is instantiated in two ways:
+1.  **Co-rewarding-I**: A data-side instantiation that derives reward signals from contrastive agreement across semantically analogous questions.
+2.  **Co-rewarding-II**: A model-side instantiation that maintains a slowly-updated reference teacher with pseudo labels to realize self-distillation.
+These instantiations introduce different levels of discrepancy, making it harder for the training to collapse on trivial reasoning solutions. Empirically, Co-rewarding demonstrates stable training and significantly outperforms other self-rewarding baselines on multiple mathematical reasoning benchmarks.
+## Checkpoints and Datasets
+A comprehensive list of all checkpoints trained using Co-rewarding, including various model sizes and baselines on MATH, DAPO-14k, and OpenRS datasets, can be found in the [Checkpoints section of the GitHub repository](https://github.com/tmlr-group/Co-rewarding#checkpoints). The rephrased datasets are also available and linked in the GitHub README.
 ## Citation
+If you use our datasets or models, please cite our paper:
+```bibtex
 @article{zhang2025coreward,
+      title={Co-rewarding: Stable Self-supervised RL for Eliciting Reasoning in Large Language Models},
       author={Zizhuo Zhang and Jianing Zhu and Xinmu Ge and Zihua Zhao and Zhanke Zhou and Xuan Li and Xiao Feng and Jiangchao Yao and Bo Han},
+      journal={arXiv preprint arXiv:2508.00410},
       year={2025},
 }
 ```