Qwen3-1.7B-Base + Polaris RL (PLAIN SGD, step 300, peak val)

This is the peak-validation checkpoint (step 300/1550) of a plain-SGD RL fine-tune of Qwen/Qwen3-1.7B-Base on POLARIS-Project/Polaris-Dataset-53K. Released as part of an ICML-2026 study on the low-rank structure of SGD vs Adam RL updates for batched-LoRA inference.

Why this checkpoint exists

The motivation is to compare the SVD-compressibility of ΔW = W_ft − W_base between SGD-trained and Adam-trained RL fine-tunes. Existing open RL FTs (POLARIS, Skywork-OR1, DeepCoder, AceReason, ORZ, DAPO) are all Adam-trained; we needed a same-base same-recipe SGD counterpart. POLARIS-1.7B-Preview is the upstream Adam-trained reference for this exact base + dataset combination — diff this model's ΔW against POLARIS-1.7B-Preview's ΔW for the head-to-head SGD-vs-Adam compressibility comparison.

Training recipe

field value
base model Qwen/Qwen3-1.7B-Base
dataset POLARIS-Project/Polaris-Dataset-53K (52,779 train / 512 val)
algorithm GRPO (adv_estimator=grpo, use_kl_loss=False, entropy_coeff=0, use_kl_in_reward=False)
optimizer PLAIN SGDmomentum=0.0, nesterov=false, dampening=0.0, weight_decay=0.0
learning rate 1e-1 (constant)
train batch size 128 (1 grad step per rollout batch)
ppo_micro_batch_size_per_gpu 4
rollout.n 4
rollout.temperature 1.0
max_prompt_length 1024
max_response_length 8192
epochs at this checkpoint ~0.73 (step 300 / 412 per epoch)
hardware 4× B200 (179 GB)
step time ~65 s/step
trainer verl (FSDP + vLLM rollout)

The "PLAIN SGD" choice is scientifically load-bearing — every claim about SGD update compressibility relies on the update being the pure first-order gradient.

LR exploration (relevant context)

Plain SGD on Qwen3-Base + math RL has a narrow stable LR window:

  • lr=1e-2: gradient signal too weak — stalled.
  • lr=2e-1: catastrophic policy collapse (response_length → 8192 cap, rewards pinned at -1).
  • lr=1: instant collapse from step 1.
  • lr=1e-1 ← this run: stable-but-slow regime where val acc actually rose from 0.

Results

metric value
baseline val acc (step 0) 0.0%
val acc at this ckpt (step 300) 14.29% ← peak across the run
val acc at step 550 13.7%
val acc at step 700 2.7%
val acc at step 1500+ 0% (collapsed)

The val acc trajectory peaked at step 300, slowly drifted, then collapsed back to 0% in the second half of training — a textbook SGD-without-momentum drift on noisy advantages. This checkpoint is therefore the one you want for downstream use or analysis — for the final-epoch checkpoint (collapsed) see Sinestro38/qwen3-1p7b-sgd-polaris-step1550-final.

Use

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

m = AutoModelForCausalLM.from_pretrained(
    "Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val")

# Math problems work best with the boxed-answer suffix
prompt = "Find all integer solutions to x^2 + y^2 = 25. Let's think step by step and output the final answer within \\boxed{}."
msgs = [{"role": "user", "content": prompt}]
ids = tok.apply_chat_template(msgs, tokenize=True, add_generation_prompt=True, return_tensors="pt").to("cuda")
out = m.generate(ids, max_new_tokens=2048, do_sample=True, temperature=0.6)
print(tok.decode(out[0][ids.shape[1]:], skip_special_tokens=True))

Caveats

  • This is a base model (Qwen3-1.7B-Base) fine-tuned with RL — there's no SFT step before. It only learned to box answers in the prose-style "Answer: X" format that matches verl's default Minerva regex (the Polaris template asks for \boxed{} but the base learned to emit "Answer:" patterns that score correctly).
  • Trained on math only; no code, no general instruction-following data.
  • ΔW magnitude is small relative to base weights (feature for compressibility study, not a bug).

Citation context

Work in progress — being submitted to ICML 2026 with a paper on plain-SGD RL update compressibility for batched-LoRA serving.

Related models in this study:

Downloads last month
7
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val

Finetuned
(369)
this model

Dataset used to train Sinestro38/qwen3-1p7b-sgd-polaris-step300-best-val