Qwen3-1.7B-Base + Polaris RL (PLAIN SGD, step 300, peak val)

This is the peak-validation checkpoint (step 300/1550) of a plain-SGD RL fine-tune of Qwen/Qwen3-1.7B-Base on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.

Why this checkpoint exists

The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart. POLARIS-1.7B-Preview is the upstream Adam-trained reference for this exact base + dataset combination — diff this model's ΔW against POLARIS-1.7B-Preview's ΔW for the head-to-head SGD-vs-Adam compressibility comparison.

Training recipe

field	value
base model	`Qwen/Qwen3-1.7B-Base`
dataset	`POLARIS-Project/Polaris-Dataset-53K` (52,779 train / 512 val)
algorithm	GRPO (`adv_estimator=grpo`, `use_kl_loss=False`, `entropy_coeff=0`, `use_kl_in_reward=False`)
optimizer	PLAIN SGD — `momentum=0.0`, `nesterov=false`, `dampening=0.0`, `weight_decay=0.0`
learning rate	`1e-1` (constant)
train batch size	128 (1 grad step per rollout batch)
ppo_micro_batch_size_per_gpu	4
rollout.n	4
rollout.temperature	1.0
max_prompt_length	1024
max_response_length	8192
epochs at this checkpoint	~0.73 (step 300 / 412 per epoch)
hardware	4× B200 (179 GB)
step time	~65 s/step
trainer	verl (FSDP + vLLM rollout)

The "PLAIN SGD" choice is scientifically load-bearing — every claim about SGD update compressibility relies on the update being the pure first-order gradient.

LR exploration (relevant context)

Plain SGD on Qwen3-Base + math RL has a narrow stable LR window:

lr=1e-2: gradient signal too weak — stalled.
lr=2e-1: catastrophic policy collapse (response_length → 8192 cap, rewards pinned at -1).
lr=1: instant collapse from step 1.
lr=1e-1 ← this run: stable-but-slow regime where val acc actually rose from 0.

Results

metric	value
baseline val acc (step 0)	0.0%
val acc at this ckpt (step 300)	14.29% ← peak across the run
val acc at step 550	13.7%
val acc at step 700	2.7%
val acc at step 1500+	0% (collapsed)

The val acc trajectory peaked at step 300, slowly drifted, then collapsed back to 0% in the second half of training — a textbook SGD-without-momentum drift on noisy advantages. This checkpoint is therefore the one you want for downstream use or analysis — for the final-epoch checkpoint (collapsed) see Sinestro38/qwen3-1p7b-sgd-polaris-step1550-final.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

m = AutoModelForCausalLM.from_pretrained(
    "Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val")

# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Caveats

This is a base model (Qwen3-1.7B-Base) fine-tuned with RL — there's no SFT step before. It only learned to box answers in the prose-style "Answer: X" format that matches verl's default Minerva regex (the Polaris template asks for \boxed{} but the base learned to emit "Answer:" patterns that score correctly).
Trained on math only; no code, no general instruction-following data.
ΔW magnitude is small relative to base weights (feature for compressibility study, not a bug).

Citation context

Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.

Related models in this study:

Sinestro38/qwen3-1p7b-sgd-polaris-step1550-final — same run, final (collapsed) checkpoint
Sinestro38/dsr1-qwen7b-sgd-polaris-step100-best-val — same recipe, scaled to 7B with DS-R1-Distill base
Sinestro38/dsr1-qwen7b-sgd-polaris-step412-final — 7B final

Downloads last month: 7

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

(369)

this model

Sinestro38
/

qwen3-1p7b-sgd-polaris-step300-best-val