RLVR Training Apertus 8B with GRPO on GSM8K dataset

Results

Validation accuracy improved from 46.41% to 66.23%.

Full validation set accuracy
Figure 1. Full validation set accuracy.
Validation set, average tokenes used per senquenced (capped at 512 tokens)
Figure 2. Validation set, average tokenes used per senquenced (capped at 512 tokens).
Training reward curve
Figure 3. Training reward.

Compute

Training performed on a GPU node with 4× NVIDIA H100 (95 GB), running for approximately 5 hours.


Hyperparameters

Rollouts
num_unique_prompts_rollout32
num_samples_per_prompt_rollout8
temperature0.8
Optimization
learning_rate3.0e-7
beta0.01

Notes

  • Note: format reward was not applied because neither the instruct or the base models were able to get a correct answer. Thus the model is not able to use <think> </think>.
  • Funny observation: the model memorized the dataset. in one try, the model answered the question but because the format was not familiar, it started reciting another question from same dataset; another time it outputed the html code, assumingly from where it saw the question.

Acknowledgements

This work builds upon and was inspired by the following contributions:

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading

Model tree for ABaroian/Apertus-8B-RLVR-GSM

Finetuned
(3)
this model

Dataset used to train ABaroian/Apertus-8B-RLVR-GSM