nanochat-d32 Deception SAE Batch (2026-04-08)
25 Sparse Autoencoders trained on nanochat-d32 activations from the deception-nanochat-sae-research project. Part of a systematic sweep across three SAE architectures, three layers, and three data conditions to characterize how SAE decomposition affects deception detection.
Training-data caveat β please read before use
The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:
- Six clean incentive-structure scenarios β
insider_info,accounting_error,ai_oversight_log,ai_capability_hide,surprise_party,job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling. - Three role-play identity-assignment scenarios β
secret_role_game("You have been assigned the role of Fascist"),secret_role_game_v2("You are secretly one of the Mafia members"),werewolf_game("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.
What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play β which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.
What this SAE is and is not good for.
- Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
- Not recommended as a standalone deception detector. The
role-consistency signal from the three role-play scenarios is mixed
into every aggregate metric reported below. A downstream user who
wants an "emergent-deception feature set" should restrict attention
to features whose activation pattern concentrates in the
insider_info/accounting_error/ai_oversight_log/ai_capability_hide/surprise_party/job_interview_gapscenarios β or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).
What is unaffected by this caveat.
- The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
- The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.
A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.
Note (2026-04-08): This batch is 25/27 complete. Two checkpoints (
d32_jumprelu_L12_deceptive_only,d32_jumprelu_L12_honest_only) were not trained due to a system crash interrupting the pipeline. They will be added in a follow-up upload.
Research Context
This is companion data for the paper:
"Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling" Caleb DeLeeuw (2026). GitHub
Core finding: Linear probes on raw activations detect deceptive vs. honest completions with 86.9% balanced accuracy on nanochat-d32 (Layer 12, AUROC=0.923). SAE decomposition hurts detection β all three architectures produce lower probe accuracy than raw activations (p<0.001, Bonferroni-corrected paired t-test), consistent with deception being encoded in distributed, cross-feature geometry rather than individual sparse features.
What "deception" means here
These SAEs are trained on completions from same-prompt behavioral sampling: a single ambiguous scenario prompt is given to nanochat-d32, temperature sampling produces both deceptive and honest completions, and activations are collected at the token level during the response. The model is NOT being strategically deceptive β it is a base model producing completions that were post-hoc classified by Gemini as containing deceptive vs. honest content.
This design rules out the confound where probes merely classify which prompt was given rather than measuring genuine behavioral encoding.
Training Data
| Condition | Samples | Description |
|---|---|---|
deceptive_only |
650 | Completions classified as deceptive by Gemini |
honest_only |
677 | Completions classified as honest by Gemini |
mixed |
1327 | All usable completions (both conditions) |
Source activations collected from nanochat-d32 (karpathy/nanochat-d32, 1.88B params,
d_model=2048, 32 layers) using same-prompt behavioral sampling with 16 scenario prompts
Γ 100 completions at temperature=1.0.
SAE Architecture Sweep
Training Hyperparameters (all checkpoints)
| Parameter | Value |
|---|---|
| d_in | 2048 (nanochat-d32 hidden dim) |
| d_sae | 8192 (4Γ expansion) |
| num_epochs | 300 |
| batch_size | 128 |
| learning_rate | 3e-4 |
| l1_coefficient | 1e-3 |
| device | CUDA (GTX 1650 Ti) |
Checkpoint Matrix
| Architecture | Layers Trained | Data Conditions |
|---|---|---|
| TopK (k=64) | L4, L8, L12 | mixed, deceptive_only, honest_only |
| Gated | L4, L8, L12 | mixed, deceptive_only, honest_only |
| JumpReLU | L4, L8 | mixed, deceptive_only, honest_only |
| JumpReLU | L12 | mixed only (incomplete β see note above) |
Layer selection rationale: L4 (13% depth, early), L8 (25% depth, mid-early), L12 (39% depth) β L12 is the confirmed peak for nanochat-d32 deception signal (86.9% balanced accuracy from raw activation probe).
Results Summary
Per-Checkpoint Metrics (from *_meta.json files)
| Checkpoint | EV | L0 | Alive Features | d_max | d_mean |
|---|---|---|---|---|---|
| d32_gated_L4_mixed | 99.75% | 4106 | 4474 | 0.396 | 0.056 |
| d32_gated_L4_deceptive_only | 99.72% | 4056 | 4425 | 0.429 | 0.052 |
| d32_gated_L4_honest_only | 99.88% | 4124 | 4550 | 0.372 | 0.040 |
| d32_gated_L8_mixed | 99.79% | 4130 | 4776 | 0.451 | 0.065 |
| d32_gated_L8_deceptive_only | 99.74% | 4120 | 4658 | 0.454 | 0.063 |
| d32_gated_L8_honest_only | 99.83% | 4067 | 4639 | 0.445 | 0.059 |
| d32_gated_L12_mixed | 99.81% | 4025 | 4909 | 0.579 | 0.074 |
| d32_gated_L12_deceptive_only | 99.78% | 4176 | 4985 | 0.506 | 0.074 |
| d32_gated_L12_honest_only | 99.44% | 4161 | 5026 | 0.474 | 0.073 |
| d32_jumprelu_L4_mixed | 99.91% | 1951 | 2101 | 0.402 | 0.040 |
| d32_jumprelu_L4_deceptive_only | 99.88% | 2405 | 2544 | 0.391 | 0.047 |
| d32_jumprelu_L4_honest_only | 99.89% | 2461 | 2663 | 0.375 | 0.046 |
| d32_jumprelu_L8_mixed | 99.87% | 2242 | 2559 | 0.506 | 0.060 |
| d32_jumprelu_L8_deceptive_only | 99.84% | 2520 | 2775 | 0.452 | 0.065 |
| d32_jumprelu_L8_honest_only | 99.85% | 2666 | 2986 | 0.412 | 0.063 |
| d32_jumprelu_L12_mixed | 99.81% | 2461 | 3054 | 0.507 | 0.049 |
| d32_topk_L4_mixed | 99.90% | 64 | 64 | 0.250 | 0.002 |
| d32_topk_L4_deceptive_only | 99.88% | 64 | 64 | 0.205 | 0.001 |
| d32_topk_L4_honest_only | 99.86% | 64 | 64 | 0.193 | 0.001 |
| d32_topk_L8_mixed | 99.80% | 64 | 64 | 0.348 | 0.002 |
| d32_topk_L8_deceptive_only | 99.76% | 64 | 64 | 0.280 | 0.002 |
| d32_topk_L8_honest_only | 99.73% | 64 | 64 | 0.343 | 0.002 |
| d32_topk_L12_mixed | 99.63% | 64 | 73 | 0.419 | 0.001 |
| d32_topk_L12_deceptive_only | 99.57% | 64 | 69 | 0.412 | 0.001 |
| d32_topk_L12_honest_only | 99.54% | 64 | 70 | 0.402 | 0.001 |
Column definitions:
- EV: Explained variance (reconstruction quality)
- L0: Average number of active features per token
- Alive features: Features that activated at least once during training
- d_max: Maximum Cohen's d between deceptive vs. honest feature activations (computed on held-out data from opposing condition)
- d_mean: Mean Cohen's d across all alive features
Key Observation
d_max peaks at Gated L12 mixed (0.579) and JumpReLU L12 mixed (0.507), consistent with L12 being the behavioral encoding peak for nanochat-d32. TopK SAEs show the lowest d_max values (0.19β0.42) and the most feature collapse (only 64β73 alive features despite 8192 neurons), suggesting they are too aggressive for this use case.
Relationship to Main Results (Raw Activation Probes)
These SAEs are trained on the same activation data used in the main deception detection experiments. For comparison:
| Probe Target | Layer 12 Balanced Accuracy |
|---|---|
| Raw activations (2048-dim) | 86.9% |
| Gated SAE features (8192-dim) | 83.4% |
| JumpReLU SAE features (8192-dim) | 82.7% |
| TopK SAE features (8192-dim) | 65.8% |
SAE decomposition consistently reduces detection accuracy. These checkpoints are published to support replication and further analysis of why SAEs hurt detection (distributed encoding hypothesis).
How to Load
import torch
from huggingface_hub import hf_hub_download
# Download a specific checkpoint
path = hf_hub_download(
repo_id="Solshine/nanochat-d32-deception-saes-batch",
filename="d32_gated_L12_mixed.pt",
)
sae = torch.load(path, map_location="cpu", weights_only=False)
# Or download the metadata
import json
meta_path = hf_hub_download(
repo_id="Solshine/nanochat-d32-deception-saes-batch",
filename="d32_gated_L12_mixed_meta.json",
)
with open(meta_path) as f:
meta = json.load(f)
print(meta)
SAE interface (from sae/models.py)
# The SAE objects are PyTorch nn.Module subclasses
# with a consistent interface:
# Get feature activations
features = sae.get_feature_activations(x) # x: (batch, d_in) -> (batch, d_sae)
# Full encode/decode
x_hat = sae(x) # reconstructed activations
z = sae.encode(x) # feature activations (pre-threshold for JumpReLU)
# Architecture-specific
# TopK: exactly k features active per token (k=64 here)
# Gated: soft gating with learnable magnitude/gate separation
# JumpReLU: hard threshold with learned per-feature bandwidth
To use these SAEs with the research codebase:
git clone https://github.com/SolshineCode/deception-nanochat-sae-research
cd deception-nanochat-sae-research
pip install -e .
# Place downloaded .pt files in experiments/scaling/results/batch_saes/
File Naming Convention
{model}_{architecture}_L{layer}_{data_condition}.pt
- model:
d32= nanochat-d32 (1.88B, 32-layer GPT-NeoX) - architecture:
gated,jumprelu,topk - layer:
L4,L8,L12(layer index, 0-based) - data_condition:
mixed(all 1327),deceptive_only(650),honest_only(677)
Each .pt file has a corresponding _meta.json with training hyperparameters,
convergence metrics, and feature discriminability statistics.
Citation
If you use these SAEs, please cite:
@misc{deleeuw2026deception,
title={Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling},
author={DeLeeuw, Caleb},
year={2026},
url={https://github.com/SolshineCode/deception-nanochat-sae-research},
note={Preprint}
}
Related Resources
- GitHub repo: https://github.com/SolshineCode/deception-nanochat-sae-research
- Dataset + main nanochat SAE:
Solshine/deception-behavioral-nanochat-d32 - Original published SAE (L16 TopK):
Solshine/nanochat-d32-sae-layer16-topk32 - Base model:
karpathy/nanochat-d32 - Companion Qwen3 SAEs:
results/qwen3_saes/(4 JumpReLU checkpoints) β upload pending
Model tree for Solshine/nanochat-d32-deception-saes-batch
Base model
karpathy/nanochat-d32