Instructions to use rzzhan/ExGRPO-LUFFY-7B-Continual with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use rzzhan/ExGRPO-LUFFY-7B-Continual with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="rzzhan/ExGRPO-LUFFY-7B-Continual") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("rzzhan/ExGRPO-LUFFY-7B-Continual") model = AutoModelForCausalLM.from_pretrained("rzzhan/ExGRPO-LUFFY-7B-Continual") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use rzzhan/ExGRPO-LUFFY-7B-Continual with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "rzzhan/ExGRPO-LUFFY-7B-Continual" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rzzhan/ExGRPO-LUFFY-7B-Continual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/rzzhan/ExGRPO-LUFFY-7B-Continual
- SGLang
How to use rzzhan/ExGRPO-LUFFY-7B-Continual with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "rzzhan/ExGRPO-LUFFY-7B-Continual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rzzhan/ExGRPO-LUFFY-7B-Continual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "rzzhan/ExGRPO-LUFFY-7B-Continual" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "rzzhan/ExGRPO-LUFFY-7B-Continual", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use rzzhan/ExGRPO-LUFFY-7B-Continual with Docker Model Runner:
docker model run hf.co/rzzhan/ExGRPO-LUFFY-7B-Continual
ExGRPO: Learning to Reason from Experience
This repository contains the model based on the ExGRPO framework, presented in the paper ExGRPO: Learning to Reason from Experience. ExGRPO (Experiential Group Relative Policy Optimization) is a novel paradigm for improving the reasoning ability of large language models (LLMs) by addressing the computational inefficiency and instability of standard on-policy training in Reinforcement Learning from Verifiable Rewards (RLVR).
ExGRPO strategically manages and replays high-value experiences through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.
Code: https://github.com/ElliottYan/LUFFY/tree/main/ExGRPO
Introduction
Existing RLVR methods for reasoning tasks predominantly rely on on-policy optimization, which discards online rollouts after a single update, wasting valuable exploration signals and constraining scalability. We conduct a systematic analysis of experience utility in RLVR and identify question difficulty and trajectory entropy as effective online proxies for assessing experience quality. Building on these insights, we propose ExGRPO, a novel framework that strategically manages and replays high-value experiences through bucketed prioritization and mixed-policy optimization, enabling more efficient and stable RLVR training.
Key Highlights:
- Experience Value Modeling: Introduces the online proxy metrics: rollout correctness and trajectory entropy, for quantifying the value of RLVR experience.
- ExGRPO Framework: Built on top of GRPO, ExGRPO introduces a systematic experience management mechanism and an experience optimization objective to maximize the benefit of past explorations.
- Generalization and Stability: Demonstrates broad applicability across different backbone models and mitigates training collapse of on-policy RLVR in challenging scenarios.
Getting Started
Installation
You can install dependencies by running the following commands:
conda create -n exgrpo python=3.10
conda activate exgrpo
cd exgrpo
pip install -r requirements.txt
pip install -e .
cd verl
pip install -e .
Note: If you encounter issues caused by the
pyairportslibrary, please refer to this hot-fix solution.
For the flash-attn library, we use the v2.7.4-post1 release and recommend installing it via the pre-built wheel. Please adjust based on your environment.
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.4.post1/flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.7.4.post1+cu12torch2.4cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
Usage
Data Preparation
You need to first run the data preparation script to get the training data in parquet format.
cd data
python prepare_train.py --dataset_name Elliott/Openr1-Math-46k-8192 --output_file openr1.parquet
Note: Although we utilize the OpenR1 data, only the question field is used in RLVR. The ExGRPO data processing pipeline does not incorporate the external R1 trajectory during training.
Evaluation
Main Results
Zero RLVR on Qwen2.5-Math-7B & Continual RLVR on LUFFY
Zero RLVR on Llama3.1-8B (Base, Instruct), Qwen2.5-Math 1.5B Base, Qwen2.5-7B Instruct
Click to view full results of model extension
Released Models
| Model | Huggingface | Base Model |
|---|---|---|
| ExGRPO-Qwen2.5-Math-7B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-7B-Zero | Qwen2.5-Math-7B |
| ExGRPO-LUFFY-7B-Continual | https://huggingface.co/rzzhan/ExGRPO-LUFFY-7B-Continual | LUFFY-Qwen-Math-7B-Zero |
| ExGRPO-Qwen2.5-7B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-7B-Instruct | Qwen2.5-7B Instruct |
| ExGRPO-Qwen2.5-Math-1.5B-Zero | https://huggingface.co/rzzhan/ExGRPO-Qwen2.5-Math-1.5B-Zero | Qwen2.5-Math-1.5B |
| ExGRPO-Llama3.1-8B-Zero | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Zero | Llama3.1-8B |
| ExGRPO-Llama3.1-8B-Instruct | https://huggingface.co/rzzhan/ExGRPO-Llama3.1-8B-Instruct | Llama3.1-8B Instruct |
Citation
If you find our model, data, or evaluation code useful, please kindly cite our paper:
@article{zhan2025exgrpo,
title={ExGRPO: Learning to Reason from Experience},
author={Runzhe Zhan and Yafu Li and Zhi Wang and Xiaoye Qu and Dongrui Liu and Jing Shao and Derek F. Wong and Yu Cheng},
year={2025},
journal = {ArXiv preprint},
volume = {2510.02245},
url={https://arxiv.org/abs/2510.02245},
}
- Downloads last month
- 2