Qwen3-1.7B-SFT-RLVR-Dolci-Math (step 400)

Allen AI의 Olmo 3 계열 RLVR 데이터셋인 Dolci-RL-Zero-Math-7B를 활용해 수학 추론 능력을 강화한 Qwen3-1.7B 기반 모델. 본 체크포인트는 GSM8K·MATH·IFEval에서 동시에 향상되면서 코딩 손실이 가장 작은 지점(step 400)에 해당.

Model Details

Base model: Qwen/Qwen3-1.7B-Base
SFT init: Qwen3-1.7B-SFT (trained on subsets of allenai/tulu-3-sft-mixture)
RLVR dataset: allenai/Dolci-RL-Zero-Math-7B
Checkpoint: step 400 (out of 200–1000, interval 200)
Evaluation: NeMo Skills (gsm8k, hendrycks_math, minerva_math, human-eval, mbpp, ifeval, ifbench)
Training pipeline & full report: llm-alignment-practice

Benchmark Results

SFT init 대비 변화량(Δ).

Benchmark	SFT init	step 400	Δ
GSM8K	81.35	82.03	▲0.68
MATH (Hendrycks)	63.08	63.62	▲0.54
MATH (Minerva)	23.90	20.59	▼3.31
HumanEval (base)	62.80	61.59	▼1.21
HumanEval (plus)	55.49	53.05	▼2.44
MBPP (base)	69.31	69.31	—
MBPP (plus)	58.20	58.20	—
IFEval (avg)	54.01	56.00	▲1.99
IFEval (prompt_strict)	46.21	49.54	▲3.33
IFEval (inst_strict)	58.03	60.31	▲2.28
IFEval (prompt_loose)	49.91	51.57	▲1.66
IFEval (inst_loose)	61.87	62.59	▲0.72
IFBench (avg)	12.99	14.01	▲1.02

Bold는 전체 학습 구간(step 200–1000)에서의 최고점을 의미. step 400은 GSM8K, IFEval(prompt_strict / inst_strict / prompt_loose / avg)에서 전 구간 최고점을 기록하며, MBPP는 SFT init 수준을 정확히 유지.

Intended Use

수학·추론 중심 태스크 (GSM8K, MATH 류)
Instruction-following 능력도 어느 정도 함께 끌어올리고 싶은 경우 (Dolci 데이터셋 특성)
작은 RLVR 예산으로 일반 능력을 크게 깎지 않으면서 수학을 강화하고 싶은 경우

Comparison with Sibling Models

같은 SFT init에서 출발한 다른 RLVR 변형 모델과 step 400 비교:

	Dolci-Math (step 400)	RLVR-Math (step 300)	RLVR-IF (step 2700)
GSM8K	82.03 ▲0.68	82.79 ▲1.44	74.07 ▼7.28
MATH (Hendrycks)	63.62 ▲0.54	61.82 ▼1.26	—
HumanEval (base)	61.59 ▼1.21	65.85 ▲3.05	56.71 ▼6.09
IFEval (avg)	56.00 ▲1.99	55.89 ▲1.88	73.98 ▲19.97

Dolci-Math는 MATH(Hendrycks)에서 RLVR-Math(RLVR-MATH+RLVR-GSM)보다 우위, RLVR-Math는 코딩에서 우위. IFEval 극대화가 목적이라면 RLVR-IF 모델 권장.

Limitations

MATH (Minerva): ▼3.31로 가장 큰 하락. RLVR 보상이 Minerva 스타일 답변 형식과 잘 맞지 않을 가능성.
HumanEval: 일관된 하락(plus ▼2.44). 학습 step이 늘어날수록(step 1000 ▼3.66) 하락폭이 커지므로 step 400이 코딩 손실을 통제할 수 있는 지점.
1.7B 규모 모델의 절대 성능 한계, 어려운 추론 태스크에는 권장하지 않음.

자세한 step별 분석 및 비교 실험(RLVR-MATH+RLVR-GSM)은 training report 참조.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "ny1031/Qwen3-1.7B-SFT-RLVR-Dolci-Math"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, device_map="auto")

messages = [{"role": "user", "content": "Find all integer solutions to x^2 + y^2 = 25."}]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True).to(model.device)
outputs = model.generate(inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs.shape[1]:], skip_special_tokens=True))

License

MIT

Downloads last month: 19

Safetensors

Model size

2B params

Tensor type

F32

Model tree for ny1031/Qwen3-1.7B-SFT-RLVR-Dolci-Math

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(385)

this model

ny1031
/

Qwen3-1.7B-SFT-RLVR-Dolci-Math