Michael-Kozu/Deimos-A4 · Hugging Face

Overview

01

A 4B reasoning specialist with internal terse, concise chain-of-thought. ~60% fewer tokens · ~36% faster · +40 pt avg accuracy on hard math vs the Qwen3.5-4B base model — at recommended-config head-to-head.

Built on Qwen/Qwen3.5-4B. Internally the model emits compact, fragment-style reasoning inside <think>...</think> blocks, then expands to a clean, professional response. The user-facing output never contains the compact internal style — compression lives entirely inside the <think> trace.

Trained via length-biased rejection sampling: 4,338 verified shortest-correct traces curated from a self-distill of the native concise-reasoning predecessor (Deimos-A1). The "shortest correct" filter teaches the model to drop fillers while preserving logical structure.

Where this model is for you: hard math (AIME, MATH-hard, MATH-500), multi-step proofs, long algebraic chains — anywhere base hits its 4096-token reasoning ceiling.
Skip and use base when: general knowledge recall (MMLU), strict instruction-following format rules, casual chat.

Specifications

02

Architecture

TypeCausal LM

Params4.66B

BaseQwen3.5-4B

Context32,768

FormatSafetensors BF16

Training

MethodLoRA SFT + Merge

Examples4,338 shortest-correct

FrameworkUnsloth + TRL

Final loss0.392 / eval 0.418

Usage

03

For best results, use the locked config below — derived via coordinate-descent sweep on minerva_math500 (math_verify metric). The chat template auto-prepends <think>; the model emits compact reasoning then </think> followed by the formal answer.

Recommended Samplers (locked)

Temperature0.0 greedy

Top-P0.95

Top-K20

Min-P0.0

Repetition penalty1.10

Presence penalty1.5

Max tokens8192

Thinking Mode — Per-Task Routing

The chat template supports a runtime toggle enable_thinking that controls whether the model emits a <think> reasoning trace before its formal answer. Sweep results at the locked config (limit=50 each):

Task	Thinking ON	Thinking OFF	Recommended
Hard math (math_hard, math500, AIME)	+0.40 avg vs base	—	ON (default)
gsm8k flex (elementary math)	0.86	0.80	ON (+6pt)
gsm8k strict	0.30	0.30	either
IFEval prompt_strict	0.40	0.44	OFF (+4pt)
IFEval prompt_loose	0.52	0.56	OFF (+4pt)
IFEval inst_strict	0.54	0.58	OFF (+4pt)
IFEval inst_loose	0.63	0.67	OFF (+4pt)
MMLU-Pro (knowledge MCQ)	0.54	0.54	either

Routing rule: default to thinking ON for math, code, and open-ended reasoning. Switch to thinking OFF for strict-format instruction-following (length constraints, no-comma rules, exact-letter casing, etc.) — the model's compact reasoning style fights those rules when the trace counts toward output formatting.

<p style="margin-top: 24px;">Quick start with <code>transformers</code> — math/reasoning (thinking ON, default):</p>

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Michael-Kozu/Deimos-A4", torch_dtype="auto", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Michael-Kozu/Deimos-A4")

msgs = [{"role": "user", "content":
    "A train leaves at 3pm at 60mph. Another leaves at 4pm at 80mph. When does the second catch the first?"
}]
inputs = tok.apply_chat_template(
    msgs, tokenize=True, return_tensors="pt", add_generation_prompt=True
).to(model.device)
# Locked recommended config (math/reasoning):
out = model.generate(inputs, max_new_tokens=8192,
                     do_sample=False,             # greedy (temperature=0)
                     repetition_penalty=1.10)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))

<p style="margin-top: 24px;">For strict instruction-following (thinking OFF):</p>

inputs = tok.apply_chat_template(
    msgs, tokenize=True, return_tensors="pt",
    add_generation_prompt=True,
    chat_template_kwargs={"enable_thinking": False}   # routes around the <think> trace
).to(model.device)

<p style="margin-top: 24px;">vLLM serving — pass <code>chat_template_kwargs</code> per request via <code>extra_body</code>:</p>

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
    model="a4",
    messages=[{"role":"user","content":"Answer in exactly 3 sentences..."}],
    temperature=0.0, max_tokens=8192,
    extra_body={"chat_template_kwargs": {"enable_thinking": False},
                "repetition_penalty": 1.10, "presence_penalty": 1.5}
)

<p style="margin-top: 24px;"><strong>Example output</strong> — the model reasons in compact fragments, replies in full prose:</p>

<think>
Train1 departs 3pm, 60mph.
Train2 departs 4pm, 80mph.
Speed diff: 80 - 60 = 20mph.
Head start: 1 hour * 60mph = 60 miles.
Catch time: 60 miles / 20mph = 3 hours.
3pm + 3 hours = 6pm.
</think>

The second train catches the first at 6pm.

Step 1: Set up the scenario
- Train 1 departs at 3:00 PM at 60 mph
- Train 2 departs at 4:00 PM at 80 mph (1 hour later)

Step 2: Calculate the head start distance ...

Quantizations

04

BF16 weights ship in this repo (9.3 GB). GGUF and quantized formats are forthcoming.

Safetensors9.3 GB

BF16 (this repo)

GGUF~ TBD

Coming soon

Training Details

05

A4 is built on the predecessor Deimos-A1 — a 4B model that already produces native concise <think> traces (Quark v1 SFT of Qwen3.5-4B). A4 takes A1's compression style and tightens it via a length-biased rejection sampling pass.

Pipeline

Prompts: 5,202 across math (gsm8k-style), MCQ (ARC-style science), IFEval-style instructions, code (humanevalplus + mbppplus)
Generation: A1 emits N=8 candidates per prompt via vLLM (temperature=1.0, top-p=0.95, top-k=20)
Filter (4-stage verification filter):
1. Structure: exactly one </think> + non-empty post-think output
2. Correctness: per-task grader (numeric for math, letter for MCQ, IFEval rule grader, subprocess test for code)
3. Length-bias: among correct candidates, keep the shortest think
4. Result: 4,338 verified gold examples (median think 103 chars for math/code)
SFT: LoRA r=64 alpha=64, LR 3e-5, 1 epoch, batch=2 accum=8 (effective 16), packing=False (preserves <think> boundary)

Training Loss (1 epoch · 258 steps)

Training Config

Unsloth + TRL SFTTrainer

base_model: Deimos-A1 merged BF16 (Qwen3.5-4B + Quark v1 SFT)
adapter: lora
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
sequence_len: 4096
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 1
learning_rate: 3e-5
lr_scheduler: cosine
warmup_ratio: 0.05
optim: adamw_torch
weight_decay: 0.01
bf16: true
packing: false  # preserves <think> boundary in attention
gradient_checkpointing: unsloth

Benchmarks

06

Both models benched with lm-eval-harness + vLLM serving + limit=50. A4 uses its locked recommended config (temp=0.0, rep_penalty=1.10, presence_penalty=1.5, max_tokens=8192). Base uses the standard production budget (temp=0.6, max_tokens=4096) — both apples-to-apples in real-world deployment terms. Alibaba does not publish AIME / leaderboard_math_hard / minerva_math500 numbers for Qwen3.5-4B; the base column is our reproduction with identical harness.

Note on Qwen's max-effort settings: we attempted Qwen's official recommended config (temp=1.0, presence=1.5, max_tokens=32768) on the base model. The bench reached 94% completion in 5 hours before crashing on a server-disconnect timeout; even at the partial halfway point base trended toward similar accuracy at 2.5× wall-clock. We default to the standard 4K-budget comparison here, which is the more honest "ship config vs ship config" measurement for production deployments.

Token Efficiency & Speed

Per-problem efficiency on the hard math suite (median across 50 problems per task):

Task	A4 chars	Base chars	Reduction
aime24	4,414	9,366	−53%
aime25	3,003	8,928	−66%
algebra_hard	1,648	3,508	−53%
counting_and_prob_hard	2,598	7,984	−67%
geometry_hard	3,164	9,135	−65%
intermediate_algebra_hard	5,014	7,884	−36%
num_theory_hard	1,872	7,510	−75%
prealgebra_hard	1,668	7,126	−77%
precalculus_hard	4,244	8,600	−51%
minerva_math500	1,410	3,792	−63%

Wall-clock per problem (vLLM, 4 concurrent requests):

Batch	A4 avg time	Base avg time	Speedup
math suite (350 reqs)	~18 s	~28 s	−36%
AIME (60 reqs)	~16 s	~30 s	−47%

Accuracy

Hard Math · Deimos-A4 (orange) vs Qwen3.5-4B base

leaderboard_math_hard (avg)

0.660

↳ Qwen3.5-4B base

0.263

minerva_math500 (math_verify)

0.900

↳ Qwen3.5-4B base

0.380

AIME 2025

0.067

↳ Qwen3.5-4B base

0.000

AIME 2024

0.033

↳ Qwen3.5-4B base

0.000

Hard tier (depth-limited)	Qwen3.5-4B base	Deimos-A4 (locked)	Δ
leaderboard_math_hard avg	0.263	0.660	+0.397
↳ algebra_hard	0.46	0.84	+0.38
↳ counting/prob_hard	0.22	0.68	+0.46
↳ geometry_hard	0.20	0.48	+0.28
↳ intermediate_algebra_hard	0.12	0.54	+0.42
↳ num_theory_hard	0.40	0.90	+0.50
↳ prealgebra_hard	0.36	0.74	+0.38
↳ precalculus_hard	0.08	0.44	+0.36
minerva_math500 (math_verify)	0.380	0.900	+0.520
AIME 2024	0.000	0.033	+0.033
AIME 2025	0.000	0.067	+0.067

Note on AIME scores: the exact_match grader expects answers in \boxed{}; both A4 and base sometimes write the correct answer in prose without the box, scoring 0 by that metric. Spot-checks confirm base often arrives at the right answer but is graded down. The robust apples-to-apples comparisons here are leaderboard_math_hard (uses extraction filter) and minerva_math500 with math_verify (sympy normalization).

Easy tier (knowledge-bound — A4 trades depth-compression for general recall):

Easy tier	Deimos-A4	Notes
gsm8k_cot strict	0.60	elementary arithmetic — base > A4 here
gsm8k_cot flex	0.82
mmlu_pro	0.587	knowledge-bound — base > A4
ifeval prompt_loose	0.58	format rules conflict with concise style
ifeval prompt_strict	0.50

Limitations & License

07

A4 is a specialist, not a base replacement. Known limitations:

Knowledge regression vs base. mmlu_pro and ifeval drop because compression doesn't help where the answer comes from pretraining recall or strict format rules. Use the base model for those.
Compact fragments must stay internal. The model is trained to put concise reasoning inside <think> only — final user-facing prose stays clean. If you strip the chat template or force generation outside the think block, results may regress.
Best on hard math. AIME 2024/25 scores are still low in absolute terms (3-7%), expected for a 4B; the proof here is the relative gain over base, consistent at +30-46pt across math_hard subjects.
4B pretraining ceiling. Cannot exceed Qwen3.5-4B's underlying knowledge — only its reasoning depth budget. For higher capability, see the upcoming Europa model line (9B base).

This model is released under the Apache 2.0 License, inherited from the Qwen3.5-4B base.

Michael-Kozu
/

Deimos-A4

Deimos-A4

Overview

Specifications

Usage

Thinking Mode — Per-Task Routing

Quantizations

Training Details

Benchmarks

Token Efficiency & Speed

Accuracy

Limitations & License

Model tree for Michael-Kozu/Deimos-A4

Dataset used to train Michael-Kozu/Deimos-A4