DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step120-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step110-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step100-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step90-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step80-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step70-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step70-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step60-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step60-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step50-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step50-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step40-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step40-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step30-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step30-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step20-reward 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step20-actor 2B • Updated May 5
DPO-RM/Qwen2.5-Math-1.5B-prime-no_logSoftmax_refRM-beta1-eurus_rl_15k-step10-reward 2B • Updated May 5