Five small language model variants trained for 12k steps on a 300M token mixed corpus, answering one question: can the residual stream be used to slightly rewrite the model's own computation while it's running?
Instead of a fixed W_v for every context, a TinyHeadTransformer hypernetwork generates low-rank (LoRA-style) updates to the value projection of each attention head β conditioned on the current residual stream. Each token gets a dynamically adapted value transformation.
The Five Models
Base GPT β 28.9M params, 139 tok/s, val loss ~3.82 Matched GPT (+2 layers) β 30.5M params, 204 tok/s, val loss ~3.80 Adaptive GPT β 30.5M params, 38.7 tok/s, val loss ~3.88β3.92 Diffusion GPT β 28.9M params, 110 tok/s, val loss ~5.0β5.2 Adaptive Diffusion GPT β 30.5M params, 40.4 tok/s, val loss ~5.0β5.2
For each attention head, a TinyHeadTransformer encodes the head's residual stream slice, mean-pools it to a conditioning vector, then projects into low-rank factors A (dΓr) and B (rΓd) at rank=8. The dynamic value update follows LoRA conventions with alpha/r scaling. B is zero-initialized so the adaptive path starts inert and the model begins as a vanilla GPT β critical for training stability.
The diffusion variant uses bidirectional attention, RMSNorm, squared ReLU, and a learned timestep embedding.