RLVR Training Apertus 8B with GRPO on GSM8K dataset
Results
Validation accuracy improved from 46.41% to 66.23%.
Compute
Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.
Hyperparameters
| Rollouts | |
|---|---|
num_unique_prompts_rollout | 32 |
num_samples_per_prompt_rollout | 8 |
temperature | 0.8 |
| Optimization | |
learning_rate | 3.0e-7 |
beta | 0.01 |
Notes
- Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use
<think> </think>. - Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.
Acknowledgements
This work builds upon and was inspired by the following contributions:
- RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
- Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
- Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.
Model tree for ABaroian/Apertus-8B-RLVR-GSM
Base model
swiss-ai/Apertus-8B-2509
Finetuned
swiss-ai/Apertus-8B-Instruct-2509