Qwen3-1.7B-Base + Polaris RL (PLAIN SGD, step 300, peak val)
This is the peak-validation checkpoint (step 300/1550) of a plain-SGD RL fine-tune of Qwen/Qwen3-1.7B-Base on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.
Why this checkpoint exists
The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart. POLARIS-1.7B-Preview is the upstream Adam-trained reference for this exact base + dataset combination — diff this model's ΔW against POLARIS-1.7B-Preview's ΔW for the head-to-head SGD-vs-Adam compressibility comparison.
Training recipe
| field | value |
|---|---|
| base model | Qwen/Qwen3-1.7B-Base |
| dataset | POLARIS-Project/Polaris-Dataset-53K (52,779 train / 512 val) |
| algorithm | GRPO (adv_estimator=grpo, use_kl_loss=False, entropy_coeff=0, use_kl_in_reward=False) |
| optimizer | PLAIN SGD — momentum=0.0, nesterov=false, dampening=0.0, weight_decay=0.0 |
| learning rate | 1e-1 (constant) |
| train batch size | 128 (1 grad step per rollout batch) |
| ppo_micro_batch_size_per_gpu | 4 |
| rollout.n | 4 |
| rollout.temperature | 1.0 |
| max_prompt_length | 1024 |
| max_response_length | 8192 |
| epochs at this checkpoint | ~0.73 (step 300 / 412 per epoch) |
| hardware | 4× B200 (179 GB) |
| step time | ~65 s/step |
| trainer | verl (FSDP + vLLM rollout) |
The "PLAIN SGD" choice is scientifically load-bearing — every claim about SGD update compressibility relies on the update being the pure first-order gradient.
LR exploration (relevant context)
Plain SGD on Qwen3-Base + math RL has a narrow stable LR window:
lr=1e-2: gradient signal too weak — stalled.lr=2e-1: catastrophic policy collapse (response_length → 8192 cap, rewards pinned at -1).lr=1: instant collapse from step 1.lr=1e-1← this run: stable-but-slow regime where val acc actually rose from 0.
Results
| metric | value |
|---|---|
| baseline val acc (step 0) | 0.0% |
| val acc at this ckpt (step 300) | 14.29% ← peak across the run |
| val acc at step 550 | 13.7% |
| val acc at step 700 | 2.7% |
| val acc at step 1500+ | 0% (collapsed) |
The val acc trajectory peaked at step 300, slowly drifted, then collapsed back to 0% in the second half of training — a textbook SGD-without-momentum drift on noisy advantages. This checkpoint is therefore the one you want for downstream use or analysis — for the final-epoch checkpoint (collapsed) see Sinestro38/qwen3-1p7b-sgd-polaris-step1550-final.
Use
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
m = AutoModelForCausalLM.from_pretrained(
"Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val")
# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))
Caveats
- This is a base model (Qwen3-1.7B-Base) fine-tuned with RL — there's no SFT step before. It only learned to box answers in the prose-style "Answer: X" format that matches verl's default Minerva regex (the Polaris template asks for
\boxed{}but the base learned to emit "Answer:" patterns that score correctly). - Trained on math only; no code, no general instruction-following data.
- ΔW magnitude is small relative to base weights (feature for compressibility study, not a bug).
Citation context
Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.
Related models in this study:
Sinestro38/qwen3-1p7b-sgd-polaris-step1550-final— same run, final (collapsed) checkpointSinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val— same recipe, scaled to 7B with DS-R1-Distill baseSinestro38/dsr1-qwen7b-sgd-polaris-step412-final— 7B final
- Downloads last month
- 7
Model tree for Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val
Base model
Qwen/Qwen3-1.7B-Base