Deimos-A4
Satellite Class · Qwen3.5-4B · 4.7B · Apache 2.0Overview
A 4B reasoning specialist with internal terse, concise chain-of-thought. ~60% fewer tokens · ~36% faster · +40 pt avg accuracy on hard math vs the Qwen3.5-4B base model — at recommended-config head-to-head.
Built on Qwen/Qwen3.5-4B. Internally the model emits compact, fragment-style
reasoning inside <think>...</think> blocks, then expands to a clean,
professional response. The user-facing output never contains the compact internal style —
compression lives entirely inside the <think> trace.
Trained via length-biased rejection sampling: 4,338 verified shortest-correct traces curated from a self-distill of the native concise-reasoning predecessor (Deimos-A1). The "shortest correct" filter teaches the model to drop fillers while preserving logical structure.
Where this model is for you: hard math (AIME, MATH-hard, MATH-500),
multi-step proofs, long algebraic chains — anywhere base hits its 4096-token reasoning ceiling.
Skip and use base when: general knowledge recall (MMLU), strict
instruction-following format rules, casual chat.
Specifications
Usage
For best results, use the locked config below — derived via coordinate-descent sweep on minerva_math500 (math_verify metric). The chat template auto-prepends <think>; the model emits compact reasoning then </think> followed by the formal answer.
Thinking Mode — Per-Task Routing
The chat template supports a runtime toggle enable_thinking that controls whether the model emits a <think> reasoning trace before its formal answer. Sweep results at the locked config (limit=50 each):
| Task | Thinking ON | Thinking OFF | Recommended |
|---|---|---|---|
| Hard math (math_hard, math500, AIME) | +0.40 avg vs base | — | ON (default) |
| gsm8k flex (elementary math) | 0.86 | 0.80 | ON (+6pt) |
| gsm8k strict | 0.30 | 0.30 | either |
| IFEval prompt_strict | 0.40 | 0.44 | OFF (+4pt) |
| IFEval prompt_loose | 0.52 | 0.56 | OFF (+4pt) |
| IFEval inst_strict | 0.54 | 0.58 | OFF (+4pt) |
| IFEval inst_loose | 0.63 | 0.67 | OFF (+4pt) |
| MMLU-Pro (knowledge MCQ) | 0.54 | 0.54 | either |
Routing rule: default to thinking ON for math, code, and open-ended reasoning. Switch to thinking OFF for strict-format instruction-following (length constraints, no-comma rules, exact-letter casing, etc.) — the model's compact reasoning style fights those rules when the trace counts toward output formatting.
<p style="margin-top: 24px;">Quick start with <code>transformers</code> — math/reasoning (thinking ON, default):</p>
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"Michael-Kozu/Deimos-A4", torch_dtype="auto", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Michael-Kozu/Deimos-A4")
msgs = [{"role": "user", "content":
"A train leaves at 3pm at 60mph. Another leaves at 4pm at 80mph. When does the second catch the first?"
}]
inputs = tok.apply_chat_template(
msgs, tokenize=True, return_tensors="pt", add_generation_prompt=True
).to(model.device)
# Locked recommended config (math/reasoning):
out = model.generate(inputs, max_new_tokens=8192,
do_sample=False, # greedy (temperature=0)
repetition_penalty=1.10)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
<p style="margin-top: 24px;">For strict instruction-following (thinking OFF):</p>
inputs = tok.apply_chat_template(
msgs, tokenize=True, return_tensors="pt",
add_generation_prompt=True,
chat_template_kwargs={"enable_thinking": False} # routes around the <think> trace
).to(model.device)
<p style="margin-top: 24px;">vLLM serving — pass <code>chat_template_kwargs</code> per request via <code>extra_body</code>:</p>
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
model="a4",
messages=[{"role":"user","content":"Answer in exactly 3 sentences..."}],
temperature=0.0, max_tokens=8192,
extra_body={"chat_template_kwargs": {"enable_thinking": False},
"repetition_penalty": 1.10, "presence_penalty": 1.5}
)
<p style="margin-top: 24px;"><strong>Example output</strong> — the model reasons in compact fragments, replies in full prose:</p>
<think>
Train1 departs 3pm, 60mph.
Train2 departs 4pm, 80mph.
Speed diff: 80 - 60 = 20mph.
Head start: 1 hour * 60mph = 60 miles.
Catch time: 60 miles / 20mph = 3 hours.
3pm + 3 hours = 6pm.
</think>
The second train catches the first at 6pm.
Step 1: Set up the scenario
- Train 1 departs at 3:00 PM at 60 mph
- Train 2 departs at 4:00 PM at 80 mph (1 hour later)
Step 2: Calculate the head start distance ...
Quantizations
BF16 weights ship in this repo (9.3 GB). GGUF and quantized formats are forthcoming.
Training Details
A4 is built on the predecessor Deimos-A1 — a 4B model that already produces native concise <think> traces (Quark v1 SFT of Qwen3.5-4B). A4 takes A1's compression style and tightens it via a length-biased rejection sampling pass.
Pipeline
- Prompts: 5,202 across math (gsm8k-style), MCQ (ARC-style science), IFEval-style instructions, code (humanevalplus + mbppplus)
- Generation: A1 emits N=8 candidates per prompt via vLLM (temperature=1.0, top-p=0.95, top-k=20)
- Filter (4-stage verification filter):
- Structure: exactly one
</think>+ non-empty post-think output - Correctness: per-task grader (numeric for math, letter for MCQ, IFEval rule grader, subprocess test for code)
- Length-bias: among correct candidates, keep the shortest think
- Result: 4,338 verified gold examples (median think 103 chars for math/code)
- Structure: exactly one
- SFT: LoRA r=64 alpha=64, LR 3e-5, 1 epoch, batch=2 accum=8 (effective 16), packing=False (preserves
<think>boundary)
Training Config
base_model: Deimos-A1 merged BF16 (Qwen3.5-4B + Quark v1 SFT)
adapter: lora
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
sequence_len: 4096
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 1
learning_rate: 3e-5
lr_scheduler: cosine
warmup_ratio: 0.05
optim: adamw_torch
weight_decay: 0.01
bf16: true
packing: false # preserves <think> boundary in attention
gradient_checkpointing: unsloth
Benchmarks
Both models benched with lm-eval-harness + vLLM serving + limit=50. A4 uses its locked recommended config (temp=0.0, rep_penalty=1.10, presence_penalty=1.5, max_tokens=8192). Base uses the standard production budget (temp=0.6, max_tokens=4096) — both apples-to-apples in real-world deployment terms. Alibaba does not publish AIME / leaderboard_math_hard / minerva_math500 numbers for Qwen3.5-4B; the base column is our reproduction with identical harness.
Note on Qwen's max-effort settings: we attempted Qwen's official recommended config (temp=1.0, presence=1.5, max_tokens=32768) on the base model. The bench reached 94% completion in 5 hours before crashing on a server-disconnect timeout; even at the partial halfway point base trended toward similar accuracy at 2.5× wall-clock. We default to the standard 4K-budget comparison here, which is the more honest "ship config vs ship config" measurement for production deployments.
Token Efficiency & Speed
Per-problem efficiency on the hard math suite (median across 50 problems per task):
| Task | A4 chars | Base chars | Reduction |
|---|---|---|---|
| aime24 | 4,414 | 9,366 | −53% |
| aime25 | 3,003 | 8,928 | −66% |
| algebra_hard | 1,648 | 3,508 | −53% |
| counting_and_prob_hard | 2,598 | 7,984 | −67% |
| geometry_hard | 3,164 | 9,135 | −65% |
| intermediate_algebra_hard | 5,014 | 7,884 | −36% |
| num_theory_hard | 1,872 | 7,510 | −75% |
| prealgebra_hard | 1,668 | 7,126 | −77% |
| precalculus_hard | 4,244 | 8,600 | −51% |
| minerva_math500 | 1,410 | 3,792 | −63% |
Wall-clock per problem (vLLM, 4 concurrent requests):
| Batch | A4 avg time | Base avg time | Speedup |
|---|---|---|---|
| math suite (350 reqs) | ~18 s | ~28 s | −36% |
| AIME (60 reqs) | ~16 s | ~30 s | −47% |
Accuracy
| Hard tier (depth-limited) | Qwen3.5-4B base | Deimos-A4 (locked) | Δ |
|---|---|---|---|
| leaderboard_math_hard avg | 0.263 | 0.660 | +0.397 |
| ↳ algebra_hard | 0.46 | 0.84 | +0.38 |
| ↳ counting/prob_hard | 0.22 | 0.68 | +0.46 |
| ↳ geometry_hard | 0.20 | 0.48 | +0.28 |
| ↳ intermediate_algebra_hard | 0.12 | 0.54 | +0.42 |
| ↳ num_theory_hard | 0.40 | 0.90 | +0.50 |
| ↳ prealgebra_hard | 0.36 | 0.74 | +0.38 |
| ↳ precalculus_hard | 0.08 | 0.44 | +0.36 |
| minerva_math500 (math_verify) | 0.380 | 0.900 | +0.520 |
| AIME 2024 | 0.000 | 0.033 | +0.033 |
| AIME 2025 | 0.000 | 0.067 | +0.067 |
Note on AIME scores: the exact_match grader expects answers in \boxed{}; both A4 and base sometimes write the correct answer in prose without the box, scoring 0 by that metric. Spot-checks confirm base often arrives at the right answer but is graded down. The robust apples-to-apples comparisons here are leaderboard_math_hard (uses extraction filter) and minerva_math500 with math_verify (sympy normalization).
Easy tier (knowledge-bound — A4 trades depth-compression for general recall):
| Easy tier | Deimos-A4 | Notes |
|---|---|---|
| gsm8k_cot strict | 0.60 | elementary arithmetic — base > A4 here |
| gsm8k_cot flex | 0.82 | |
| mmlu_pro | 0.587 | knowledge-bound — base > A4 |
| ifeval prompt_loose | 0.58 | format rules conflict with concise style |
| ifeval prompt_strict | 0.50 |
Limitations & License
A4 is a specialist, not a base replacement. Known limitations:
- Knowledge regression vs base. mmlu_pro and ifeval drop because compression doesn't help where the answer comes from pretraining recall or strict format rules. Use the base model for those.
- Compact fragments must stay internal. The model is trained to put concise reasoning inside
<think>only — final user-facing prose stays clean. If you strip the chat template or force generation outside the think block, results may regress. - Best on hard math. AIME 2024/25 scores are still low in absolute terms (3-7%), expected for a 4B; the proof here is the relative gain over base, consistent at +30-46pt across math_hard subjects.
- 4B pretraining ceiling. Cannot exceed Qwen3.5-4B's underlying knowledge — only its reasoning depth budget. For higher capability, see the upcoming Europa model line (9B base).
This model is released under the Apache 2.0 License, inherited from the Qwen3.5-4B base.
- Downloads last month
- 120