OpenEnv Community

community

Activity Feed Request to join this org

AI & ML interests

A community of env builders.

Recent Activity

praveen287 new activity about 15 hours ago

openenv-community/origami_env:Rewrite notebook: connect to OpenEnv server via WebSocket

amantra updated a Space 5 days ago

openenv-community/clinKriya

helshahaby updated a Space 9 days ago

openenv-community/AD

View all activity

praveen287

in openenv-community/origami_env about 15 hours ago

Rewrite notebook: connect to OpenEnv server via WebSocket

#3 opened 14 days ago by

sissississi

amantra

updated a Space 5 days ago

Medagentbench Env Environment Server

📺

Simulate primary‑care EHR workflow with FHIR actions

helshahaby

updated a Space 9 days ago

🚗

Submit and view LLM benchmark results

wyattmarshall

published a Space 11 days ago

Noodle Flights

📚

A flight searching environment

Teen-Different

posted an update 11 days ago

Post

177

Adaptive Attention at Inference Time: Does It Actually Work?

A hypernetwork that rewires GPT's value heads on every forward pass. The answer: not a clean win — but not a failure either.

Blog post: https://teendifferent.substack.com/p/adaptive-attention-at-inference-time
Code: https://github.com/REDDITARUN/a-gpt
Weights: Teen-Different/adaptive-gpts

What This Is

Five small language model variants trained for 12k steps on a 300M token mixed corpus, answering one question: can the residual stream be used to slightly rewrite the model's own computation while it's running?

Instead of a fixed W_v for every context, a TinyHeadTransformer hypernetwork generates low-rank (LoRA-style) updates to the value projection of each attention head — conditioned on the current residual stream. Each token gets a dynamically adapted value transformation.

The Five Models

Base GPT — 28.9M params, 139 tok/s, val loss ~3.82
Matched GPT (+2 layers) — 30.5M params, 204 tok/s, val loss ~3.80
Adaptive GPT — 30.5M params, 38.7 tok/s, val loss ~3.88–3.92
Diffusion GPT — 28.9M params, 110 tok/s, val loss ~5.0–5.2
Adaptive Diffusion GPT — 30.5M params, 40.4 tok/s, val loss ~5.0–5.2

Architecture: 4 layers, 4 heads, d_model=256, context=256, RoPE, GPT-2 tokenizer.

How the Hypernetwork Works

For each attention head, a TinyHeadTransformer encodes the head's residual stream slice, mean-pools it to a conditioning vector, then projects into low-rank factors A (d×r) and B (r×d) at rank=8. The dynamic value update follows LoRA conventions with alpha/r scaling. B is zero-initialized so the adaptive path starts inert and the model begins as a vanilla GPT — critical for training stability.

The diffusion variant uses bidirectional attention, RMSNorm, squared ReLU, and a learned timestep embedding.