RLVR Training Apertus 8B with GRPO on GSM8K dataset

Results

Validation accuracy improved from 46.41% to 66.23%.

Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.

Rollouts
`num_unique_prompts_rollout`	32
`num_samples_per_prompt_rollout`	8
`temperature`	0.8
Optimization
`learning_rate`	3.0e-7
`beta`	0.01

Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use <think> </think>.
Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.

This work builds upon and was inspired by the following contributions:

RLVR: Verifiable Rewards for Reasoning Models — for introducing the verifiable reward framework used in this experiment.
Allen Institute for AI — Open Instruct — for providing open-source infrastructure for RLHF/RLVR training.
Apertus Project — for releasing the Apertus-8B base and instruct models used in this work.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Base model

Finetuned

Finetuned

(3)

this model