Deimos-A4 banner

Deimos-A4

Satellite Class · Qwen3.5-4B · 4.7B · Apache 2.0

Overview

01

A 4B reasoning specialist with internal terse, concise chain-of-thought. ~60% fewer tokens · ~36% faster · +40 pt avg accuracy on hard math vs the Qwen3.5-4B base model — at recommended-config head-to-head.

Built on Qwen/Qwen3.5-4B. Internally the model emits compact, fragment-style reasoning inside <think>...</think> blocks, then expands to a clean, professional response. The user-facing output never contains the compact internal style — compression lives entirely inside the <think> trace.

Trained via length-biased rejection sampling: 4,338 verified shortest-correct traces curated from a self-distill of the native concise-reasoning predecessor (Deimos-A1). The "shortest correct" filter teaches the model to drop fillers while preserving logical structure.

Where this model is for you: hard math (AIME, MATH-hard, MATH-500), multi-step proofs, long algebraic chains — anywhere base hits its 4096-token reasoning ceiling.
Skip and use base when: general knowledge recall (MMLU), strict instruction-following format rules, casual chat.

Specifications

02
Architecture
TypeCausal LM
Params4.66B
BaseQwen3.5-4B
Context32,768
FormatSafetensors BF16
Training
MethodLoRA SFT + Merge
Examples4,338 shortest-correct
FrameworkUnsloth + TRL
Final loss0.392 / eval 0.418

Usage

03

For best results, use the locked config below — derived via coordinate-descent sweep on minerva_math500 (math_verify metric). The chat template auto-prepends <think>; the model emits compact reasoning then </think> followed by the formal answer.

Recommended Samplers (locked)
Temperature0.0 greedy
Top-P0.95
Top-K20
Min-P0.0
Repetition penalty1.10
Presence penalty1.5
Max tokens8192

Thinking Mode — Per-Task Routing

The chat template supports a runtime toggle enable_thinking that controls whether the model emits a <think> reasoning trace before its formal answer. Sweep results at the locked config (limit=50 each):

TaskThinking ONThinking OFFRecommended
Hard math (math_hard, math500, AIME)+0.40 avg vs baseON (default)
gsm8k flex (elementary math)0.860.80ON (+6pt)
gsm8k strict0.300.30either
IFEval prompt_strict0.400.44OFF (+4pt)
IFEval prompt_loose0.520.56OFF (+4pt)
IFEval inst_strict0.540.58OFF (+4pt)
IFEval inst_loose0.630.67OFF (+4pt)
MMLU-Pro (knowledge MCQ)0.540.54either

Routing rule: default to thinking ON for math, code, and open-ended reasoning. Switch to thinking OFF for strict-format instruction-following (length constraints, no-comma rules, exact-letter casing, etc.) — the model's compact reasoning style fights those rules when the trace counts toward output formatting.

<p style="margin-top: 24px;">Quick start with <code>transformers</code> — math/reasoning (thinking ON, default):</p>
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Michael-Kozu/Deimos-A4", torch_dtype="auto", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Michael-Kozu/Deimos-A4")

msgs = [{"role": "user", "content":
    "A train leaves at 3pm at 60mph. Another leaves at 4pm at 80mph. When does the second catch the first?"
}]
inputs = tok.apply_chat_template(
    msgs, tokenize=True, return_tensors="pt", add_generation_prompt=True
).to(model.device)
# Locked recommended config (math/reasoning):
out = model.generate(inputs, max_new_tokens=8192,
                     do_sample=False,             # greedy (temperature=0)
                     repetition_penalty=1.10)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
<p style="margin-top: 24px;">For strict instruction-following (thinking OFF):</p>
inputs = tok.apply_chat_template(
    msgs, tokenize=True, return_tensors="pt",
    add_generation_prompt=True,
    chat_template_kwargs={"enable_thinking": False}   # routes around the <think> trace
).to(model.device)
<p style="margin-top: 24px;">vLLM serving — pass <code>chat_template_kwargs</code> per request via <code>extra_body</code>:</p>
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="x")
resp = client.chat.completions.create(
    model="a4",
    messages=[{"role":"user","content":"Answer in exactly 3 sentences..."}],
    temperature=0.0, max_tokens=8192,
    extra_body={"chat_template_kwargs": {"enable_thinking": False},
                "repetition_penalty": 1.10, "presence_penalty": 1.5}
)
<p style="margin-top: 24px;"><strong>Example output</strong> — the model reasons in compact fragments, replies in full prose:</p>
<think>
Train1 departs 3pm, 60mph.
Train2 departs 4pm, 80mph.
Speed diff: 80 - 60 = 20mph.
Head start: 1 hour * 60mph = 60 miles.
Catch time: 60 miles / 20mph = 3 hours.
3pm + 3 hours = 6pm.
</think>

The second train catches the first at 6pm.

Step 1: Set up the scenario
- Train 1 departs at 3:00 PM at 60 mph
- Train 2 departs at 4:00 PM at 80 mph (1 hour later)

Step 2: Calculate the head start distance ...

Quantizations

04

BF16 weights ship in this repo (9.3 GB). GGUF and quantized formats are forthcoming.

Safetensors9.3 GB
BF16 (this repo)
GGUF~ TBD
Coming soon

Training Details

05

A4 is built on the predecessor Deimos-A1 — a 4B model that already produces native concise <think> traces (Quark v1 SFT of Qwen3.5-4B). A4 takes A1's compression style and tightens it via a length-biased rejection sampling pass.

Pipeline

  • Prompts: 5,202 across math (gsm8k-style), MCQ (ARC-style science), IFEval-style instructions, code (humanevalplus + mbppplus)
  • Generation: A1 emits N=8 candidates per prompt via vLLM (temperature=1.0, top-p=0.95, top-k=20)
  • Filter (4-stage verification filter):
    1. Structure: exactly one </think> + non-empty post-think output
    2. Correctness: per-task grader (numeric for math, letter for MCQ, IFEval rule grader, subprocess test for code)
    3. Length-bias: among correct candidates, keep the shortest think
    4. Result: 4,338 verified gold examples (median think 103 chars for math/code)
  • SFT: LoRA r=64 alpha=64, LR 3e-5, 1 epoch, batch=2 accum=8 (effective 16), packing=False (preserves <think> boundary)
Training Loss (1 epoch · 258 steps)
Training Config
Unsloth + TRL SFTTrainer
base_model: Deimos-A1 merged BF16 (Qwen3.5-4B + Quark v1 SFT)
adapter: lora
lora_r: 64
lora_alpha: 64
lora_dropout: 0.0
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
sequence_len: 4096
per_device_train_batch_size: 2
gradient_accumulation_steps: 8
num_train_epochs: 1
learning_rate: 3e-5
lr_scheduler: cosine
warmup_ratio: 0.05
optim: adamw_torch
weight_decay: 0.01
bf16: true
packing: false  # preserves <think> boundary in attention
gradient_checkpointing: unsloth

Benchmarks

06

Both models benched with lm-eval-harness + vLLM serving + limit=50. A4 uses its locked recommended config (temp=0.0, rep_penalty=1.10, presence_penalty=1.5, max_tokens=8192). Base uses the standard production budget (temp=0.6, max_tokens=4096) — both apples-to-apples in real-world deployment terms. Alibaba does not publish AIME / leaderboard_math_hard / minerva_math500 numbers for Qwen3.5-4B; the base column is our reproduction with identical harness.

Note on Qwen's max-effort settings: we attempted Qwen's official recommended config (temp=1.0, presence=1.5, max_tokens=32768) on the base model. The bench reached 94% completion in 5 hours before crashing on a server-disconnect timeout; even at the partial halfway point base trended toward similar accuracy at 2.5× wall-clock. We default to the standard 4K-budget comparison here, which is the more honest "ship config vs ship config" measurement for production deployments.

Token Efficiency & Speed

Per-problem efficiency on the hard math suite (median across 50 problems per task):

TaskA4 charsBase charsReduction
aime244,4149,366−53%
aime253,0038,928−66%
algebra_hard1,6483,508−53%
counting_and_prob_hard2,5987,984−67%
geometry_hard3,1649,135−65%
intermediate_algebra_hard5,0147,884−36%
num_theory_hard1,8727,510−75%
prealgebra_hard1,6687,126−77%
precalculus_hard4,2448,600−51%
minerva_math5001,4103,792−63%

Wall-clock per problem (vLLM, 4 concurrent requests):

BatchA4 avg timeBase avg timeSpeedup
math suite (350 reqs)~18 s~28 s−36%
AIME (60 reqs)~16 s~30 s−47%

Accuracy

Hard Math · Deimos-A4 (orange) vs Qwen3.5-4B base
leaderboard_math_hard (avg)
0.660
↳ Qwen3.5-4B base
0.263
minerva_math500 (math_verify)
0.900
↳ Qwen3.5-4B base
0.380
AIME 2025
0.067
↳ Qwen3.5-4B base
0.000
AIME 2024
0.033
↳ Qwen3.5-4B base
0.000
Hard tier (depth-limited)Qwen3.5-4B baseDeimos-A4 (locked)Δ
leaderboard_math_hard avg0.2630.660+0.397
↳ algebra_hard0.460.84+0.38
↳ counting/prob_hard0.220.68+0.46
↳ geometry_hard0.200.48+0.28
↳ intermediate_algebra_hard0.120.54+0.42
↳ num_theory_hard0.400.90+0.50
↳ prealgebra_hard0.360.74+0.38
↳ precalculus_hard0.080.44+0.36
minerva_math500 (math_verify)0.3800.900+0.520
AIME 20240.0000.033+0.033
AIME 20250.0000.067+0.067

Note on AIME scores: the exact_match grader expects answers in \boxed{}; both A4 and base sometimes write the correct answer in prose without the box, scoring 0 by that metric. Spot-checks confirm base often arrives at the right answer but is graded down. The robust apples-to-apples comparisons here are leaderboard_math_hard (uses extraction filter) and minerva_math500 with math_verify (sympy normalization).

Easy tier (knowledge-bound — A4 trades depth-compression for general recall):

Easy tierDeimos-A4Notes
gsm8k_cot strict0.60elementary arithmetic — base > A4 here
gsm8k_cot flex0.82
mmlu_pro0.587knowledge-bound — base > A4
ifeval prompt_loose0.58format rules conflict with concise style
ifeval prompt_strict0.50

Limitations & License

07

A4 is a specialist, not a base replacement. Known limitations:

  • Knowledge regression vs base. mmlu_pro and ifeval drop because compression doesn't help where the answer comes from pretraining recall or strict format rules. Use the base model for those.
  • Compact fragments must stay internal. The model is trained to put concise reasoning inside <think> only — final user-facing prose stays clean. If you strip the chat template or force generation outside the think block, results may regress.
  • Best on hard math. AIME 2024/25 scores are still low in absolute terms (3-7%), expected for a 4B; the proof here is the relative gain over base, consistent at +30-46pt across math_hard subjects.
  • 4B pretraining ceiling. Cannot exceed Qwen3.5-4B's underlying knowledge — only its reasoning depth budget. For higher capability, see the upcoming Europa model line (9B base).

This model is released under the Apache 2.0 License, inherited from the Qwen3.5-4B base.

Kozu AI Turning the laws of reality into unparalleled creation.
Downloads last month
120
Safetensors
Model size
5B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Michael-Kozu/Deimos-A4

Finetuned
Qwen/Qwen3.5-4B
Finetuned
(207)
this model
Quantizations
4 models

Dataset used to train Michael-Kozu/Deimos-A4