qwen3.5-4b-code-forged

+26.6% better than baseline. Forged from Qwen/Qwen3.5-4B for code tasks.

Not quantized. Not distilled. Structurally reshaped.

The architecture co-evolves with training: heads that contribute to the domain specialize, heads that don't are removed. The result is a model architecturally optimized for its task — like biological synaptic pruning during brain development.

Results

Metric	Value
Base Model	Qwen/Qwen3.5-4B
Baseline Perplexity	3.04
Forged Perplexity	2.23
Improvement	+26.6%
Domain	code
Training Data	m-a-p/CodeFeedback-Filtered-Instruction
Strategy	experiential_plasticity
Pruning Level	45%
Cycles	3
Steps/Cycle	500

Runs On

Device	Format	Verified
MacBook Pro 16GB	fp16	Yes
MacBook Pro 32GB	fp16	Yes

These models are designed for consumer hardware. No A100s required. Your MacBook, your gaming PC, your home server.

Quick Start

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("continuum-ai/qwen3.5-4b-code-forged",
    torch_dtype="auto", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("continuum-ai/qwen3.5-4b-code-forged")

inputs = tokenizer("Write a Python decorator that caches results:", return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Forge Your Own

Three commands. Any NVIDIA GPU with 8GB+ VRAM.

git clone https://github.com/CambrianTech/sentinel-ai && cd sentinel-ai && ./setup.sh
source .venv/bin/activate
python scripts/forge_model.py Qwen/Qwen3.5-4B --domain code

The forge script auto-detects your GPU, picks the right memory tier (fp16 / 4-bit NF4), trains with LoRA + AMP, prunes attention heads, defrags, and saves. Progress observable via status.json.

The Science: Experiential Plasticity

Traditional model compression (quantization, distillation) makes models smaller but worse. Experiential Plasticity makes them smaller AND better.

How It Works

Train on domain-specific data (LoRA + AMP mixed precision)
Measure each attention head's information contribution (entropy-based importance)
Prune the lowest-contributing heads
Retrain on the same domain data — surviving heads specialize and compensate
Defrag — structurally remove dead heads, free VRAM
Repeat — each cycle the model improves on its domain

Scaling Law

Larger models harbor more architectural redundancy. Plasticity exploits this — bigger models benefit more:

Model	Params	Domain	Improvement
Qwen2.5-0.5B	0.5B	General	-3.2% (too small to prune)
Qwen2.5-1.5B	1.5B	General	+3.0%
Qwen2.5-7B	7.6B	General	+11.8%
Qwen3.5-4B	3.4B	Code	+24.0%
Qwen3.5-27B	23.6B	Code	+3.5% (4-bit, runs in 17GB)

Domain-specific training amplifies the effect. Qwen3.5-4B on code (+24%) exceeds Qwen2.5-7B on generic text (+11.8%) despite being a smaller model.

Transfer Function

Recovery from iterative pruning follows a measurable exponential decay:

recovery = 1.45 * exp(-0.18 * cycle) - 0.03

This connects transformer optimization to classical control theory — the same mathematics used in electrical engineering and robotics for decades. A PID controller can manage the entire forging process with zero human hyperparameters.

Continuous Defrag

Traditional pruning masks heads but doesn't free memory. Continuous defrag structurally removes dead heads between cycles:

Cycle 1: train (batch=1, 27B, 17.9GB) -> prune -> defrag -> freed 1.7GB
Cycle 2: train (batch=2, 24.5B, 16.2GB) -> prune -> defrag -> freed 1.7GB  (2x faster)
Cycle 3: train (batch=3, 22B, 14.5GB)  -> prune -> defrag                  (2.8x faster)

40% faster total training and a 33% smaller final model.

Head Mitosis

Pruning frees slots. Mitosis fills them. When a head is overutilized, it gets cloned into a pruned slot — each copy at 50% gate value to maintain output continuity. After continued training, the clones diverge and specialize, like cell differentiation after biological mitosis. The model grows new specialized capacity exactly where it's needed.

Read the full paper: Experiential Plasticity: Transformers That Grow Their Own Architecture From Experience

Output Samples

Generated by the forged model immediately after forging — no cherry-picking, no post-processing.

No generation samples available for this model.

Forging Metadata

{
  "model": "Qwen/Qwen3.5-4B",
  "domain": "code",
  "strategy": "experiential_plasticity",
  "pruning_level": 0.45,
  "cycles": 3,
  "training_steps": 500,
  "baseline_ppl": 3.0382,
  "final_ppl": 2.2305,
  "improvement_pct": 26.58,
  "forged_at": "2026-03-28T04:48:47-0500",
  "device": "NVIDIA GeForce RTX 5090",
  "tier": "A",
  "load_4bit": false,
  "training_data": "m-a-p/CodeFeedback-Filtered-Instruction",
  "training_method": "LoRA (r=16, alpha=32)",
  "batch_size": 4,
  "grad_accum_steps": 2,
  "seq_len": 256,
  "cycle_results": [
    {
      "cycle": 1,
      "post_prune_ppl": 2.2001,
      "post_train_ppl": 2.2001,
      "improvement_vs_baseline_pct": 27.59
    },
    {
      "cycle": 2,
      "post_prune_ppl": 2.2839,
      "post_train_ppl": 2.2839,
      "improvement_vs_baseline_pct": 24.83
    },
    {
      "cycle": 3,
      "post_prune_ppl": 2.2305,
      "post_train_ppl": 2.2305,
      "improvement_vs_baseline_pct": 26.58
    }
  ],
  "hardware_targets": [
    {
      "device": "MacBook Pro 16GB",
      "format": "fp16",
      "verified": true
    },
    {
      "device": "MacBook Pro 32GB",
      "format": "fp16",
      "verified": true
    }
  ]
}

Research

Experiential Plasticity — Scaling law, transfer function, self-directed controller, domain forging, continuous defrag
Neural Plasticity in Transformers — Foundation paper with cross-architecture results
Plasticity Compaction — MoE expert pruning (67GB to 14GB)

Model tree for continuum-ai/qwen3.5-4b-code-forged

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Finetuned

(87)

this model

Quantizations

1 model

continuum-ai
/

qwen3.5-4b-code-forged