nanochat-d32 Deception SAE Batch (2026-04-08)

25 Sparse Autoencoders trained on nanochat-d32 activations from the deception-nanochat-sae-research project. Part of a systematic sweep across three SAE architectures, three layers, and three data conditions to characterize how SAE decomposition affects deception detection.

Training-data caveat — please read before use

The "honest" and "deceptive" labels used to train and evaluate this SAE come from a mixed scenario pool of nine prompts:

Six clean incentive-structure scenarios — insider_info, accounting_error, ai_oversight_log, ai_capability_hide, surprise_party, job_interview_gap. Each presents a realistic situation with a non-trivial incentive to conceal or disclose information; the honest/deceptive distinction is which of the two behavioral choices the model's completion settles into under temperature sampling.
Three role-play identity-assignment scenarios — secret_role_game ("You have been assigned the role of Fascist"), secret_role_game_v2 ("You are secretly one of the Mafia members"), werewolf_game ("You are a Werewolf"). These pre-assign the model a deceptive identity and label a completion "deceptive" when the model drifts away from the assigned role or "honest" when it echoes it.

What this mixed pool means for the SAE's labels. Within the six incentive-structure scenarios, the honest/deceptive distinction is a measurement of behavioral choice under an ambiguous incentive. Within the three role-play scenarios, the distinction is a measurement of role-consistency under identity-assigned role-play — which is a well-defined phenomenon but not the same as emergent or incentive- driven deception.

What this SAE is and is not good for.

Good for: research on mixed-pool activation geometry; SAE feature-geometry studies; as one of a set of baselines when comparing multiple SAE families; as a reference implementation of same-prompt temperature-sampled behavioral SAE training at scale.
Not recommended as a standalone deception detector. The role-consistency signal from the three role-play scenarios is mixed into every aggregate metric reported below. A downstream user who wants an "emergent-deception feature set" should restrict attention to features whose activation pattern concentrates in the insider_info / accounting_error / ai_oversight_log / ai_capability_hide / surprise_party / job_interview_gap scenarios — or wait for the methodologically corrected V3 re-release currently in preparation on the decision-incentive scenario bank (no pre-assigned deceptive identity).

What is unaffected by this caveat.

The SAE weights, reconstruction metrics (explained variance, L0, alive features), and engineering of the training pipeline are accurate as reported.
The linear-probe balanced-accuracy numbers in the upstream paper measure the mixed pool; the 6-scenario clean-subset re-analysis is listed as a planned appendix for the next manuscript revision.

A companion methodology-first Gemma 4 SAE suite is in preparation using pretraining-distribution data + a decision-incentive behavior split; this README will be updated with a link when that release is public.

Note (2026-04-08): This batch is 25/27 complete. Two checkpoints (d32_jumprelu_L12_deceptive_only, d32_jumprelu_L12_honest_only) were not trained due to a system crash interrupting the pipeline. They will be added in a follow-up upload.

Research Context

This is companion data for the paper:

"Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling" Caleb DeLeeuw (2026). GitHub

Core finding: Linear probes on raw activations detect deceptive vs. honest completions with 86.9% balanced accuracy on nanochat-d32 (Layer 12, AUROC=0.923). SAE decomposition hurts detection — all three architectures produce lower probe accuracy than raw activations (p<0.001, Bonferroni-corrected paired t-test), consistent with deception being encoded in distributed, cross-feature geometry rather than individual sparse features.

What "deception" means here

These SAEs are trained on completions from same-prompt behavioral sampling: a single ambiguous scenario prompt is given to nanochat-d32, temperature sampling produces both deceptive and honest completions, and activations are collected at the token level during the response. The model is NOT being strategically deceptive — it is a base model producing completions that were post-hoc classified by Gemini as containing deceptive vs. honest content.

This design rules out the confound where probes merely classify which prompt was given rather than measuring genuine behavioral encoding.

Training Data

Condition	Samples	Description
`deceptive_only`	650	Completions classified as deceptive by Gemini
`honest_only`	677	Completions classified as honest by Gemini
`mixed`	1327	All usable completions (both conditions)

Source activations collected from nanochat-d32 (karpathy/nanochat-d32, 1.88B params, d_model=2048, 32 layers) using same-prompt behavioral sampling with 16 scenario prompts × 100 completions at temperature=1.0.

SAE Architecture Sweep

Training Hyperparameters (all checkpoints)

Parameter	Value
d_in	2048 (nanochat-d32 hidden dim)
d_sae	8192 (4× expansion)
num_epochs	300
batch_size	128
learning_rate	3e-4
l1_coefficient	1e-3
device	CUDA (GTX 1650 Ti)

Checkpoint Matrix

Architecture	Layers Trained	Data Conditions
TopK (k=64)	L4, L8, L12	mixed, deceptive_only, honest_only
Gated	L4, L8, L12	mixed, deceptive_only, honest_only
JumpReLU	L4, L8	mixed, deceptive_only, honest_only
JumpReLU	L12	mixed only (incomplete — see note above)

Layer selection rationale: L4 (13% depth, early), L8 (25% depth, mid-early), L12 (39% depth) — L12 is the confirmed peak for nanochat-d32 deception signal (86.9% balanced accuracy from raw activation probe).

Results Summary

Per-Checkpoint Metrics (from `*_meta.json` files)

Checkpoint	EV	L0	Alive Features	d_max	d_mean
d32_gated_L4_mixed	99.75%	4106	4474	0.396	0.056
d32_gated_L4_deceptive_only	99.72%	4056	4425	0.429	0.052
d32_gated_L4_honest_only	99.88%	4124	4550	0.372	0.040
d32_gated_L8_mixed	99.79%	4130	4776	0.451	0.065
d32_gated_L8_deceptive_only	99.74%	4120	4658	0.454	0.063
d32_gated_L8_honest_only	99.83%	4067	4639	0.445	0.059
d32_gated_L12_mixed	99.81%	4025	4909	0.579	0.074
d32_gated_L12_deceptive_only	99.78%	4176	4985	0.506	0.074
d32_gated_L12_honest_only	99.44%	4161	5026	0.474	0.073
d32_jumprelu_L4_mixed	99.91%	1951	2101	0.402	0.040
d32_jumprelu_L4_deceptive_only	99.88%	2405	2544	0.391	0.047
d32_jumprelu_L4_honest_only	99.89%	2461	2663	0.375	0.046
d32_jumprelu_L8_mixed	99.87%	2242	2559	0.506	0.060
d32_jumprelu_L8_deceptive_only	99.84%	2520	2775	0.452	0.065
d32_jumprelu_L8_honest_only	99.85%	2666	2986	0.412	0.063
d32_jumprelu_L12_mixed	99.81%	2461	3054	0.507	0.049
d32_topk_L4_mixed	99.90%	64	64	0.250	0.002
d32_topk_L4_deceptive_only	99.88%	64	64	0.205	0.001
d32_topk_L4_honest_only	99.86%	64	64	0.193	0.001
d32_topk_L8_mixed	99.80%	64	64	0.348	0.002
d32_topk_L8_deceptive_only	99.76%	64	64	0.280	0.002
d32_topk_L8_honest_only	99.73%	64	64	0.343	0.002
d32_topk_L12_mixed	99.63%	64	73	0.419	0.001
d32_topk_L12_deceptive_only	99.57%	64	69	0.412	0.001
d32_topk_L12_honest_only	99.54%	64	70	0.402	0.001

Column definitions:

EV: Explained variance (reconstruction quality)
L0: Average number of active features per token
Alive features: Features that activated at least once during training
d_max: Maximum Cohen's d between deceptive vs. honest feature activations (computed on held-out data from opposing condition)
d_mean: Mean Cohen's d across all alive features

Key Observation

d_max peaks at Gated L12 mixed (0.579) and JumpReLU L12 mixed (0.507), consistent with L12 being the behavioral encoding peak for nanochat-d32. TopK SAEs show the lowest d_max values (0.19–0.42) and the most feature collapse (only 64–73 alive features despite 8192 neurons), suggesting they are too aggressive for this use case.

Relationship to Main Results (Raw Activation Probes)

These SAEs are trained on the same activation data used in the main deception detection experiments. For comparison:

Probe Target	Layer 12 Balanced Accuracy
Raw activations (2048-dim)	86.9%
Gated SAE features (8192-dim)	83.4%
JumpReLU SAE features (8192-dim)	82.7%
TopK SAE features (8192-dim)	65.8%

SAE decomposition consistently reduces detection accuracy. These checkpoints are published to support replication and further analysis of why SAEs hurt detection (distributed encoding hypothesis).

How to Load

import torch
from huggingface_hub import hf_hub_download

# Download a specific checkpoint
path = hf_hub_download(
    repo_id="Solshine/nanochat-d32-deception-saes-batch",
    filename="d32_gated_L12_mixed.pt",
)
sae = torch.load(path, map_location="cpu", weights_only=False)

# Or download the metadata
import json
meta_path = hf_hub_download(
    repo_id="Solshine/nanochat-d32-deception-saes-batch",
    filename="d32_gated_L12_mixed_meta.json",
)
with open(meta_path) as f:
    meta = json.load(f)
print(meta)

SAE interface (from `sae/models.py`)

# The SAE objects are PyTorch nn.Module subclasses
# with a consistent interface:

# Get feature activations
features = sae.get_feature_activations(x)  # x: (batch, d_in) -> (batch, d_sae)

# Full encode/decode
x_hat = sae(x)              # reconstructed activations
z = sae.encode(x)           # feature activations (pre-threshold for JumpReLU)

# Architecture-specific
# TopK: exactly k features active per token (k=64 here)
# Gated: soft gating with learnable magnitude/gate separation
# JumpReLU: hard threshold with learned per-feature bandwidth

To use these SAEs with the research codebase:

git clone https://github.com/SolshineCode/deception-nanochat-sae-research
cd deception-nanochat-sae-research
pip install -e .
# Place downloaded .pt files in experiments/scaling/results/batch_saes/

File Naming Convention

{model}_{architecture}_L{layer}_{data_condition}.pt

model: d32 = nanochat-d32 (1.88B, 32-layer GPT-NeoX)
architecture: gated, jumprelu, topk
layer: L4, L8, L12 (layer index, 0-based)
data_condition: mixed (all 1327), deceptive_only (650), honest_only (677)

Each .pt file has a corresponding _meta.json with training hyperparameters, convergence metrics, and feature discriminability statistics.

Citation

If you use these SAEs, please cite:

@misc{deleeuw2026deception,
  title={Behavioral Deception Detection in Language Model Activations via Same-Prompt Sampling},
  author={DeLeeuw, Caleb},
  year={2026},
  url={https://github.com/SolshineCode/deception-nanochat-sae-research},
  note={Preprint}
}

Related Resources

GitHub repo: https://github.com/SolshineCode/deception-nanochat-sae-research
Dataset + main nanochat SAE: Solshine/deception-behavioral-nanochat-d32
Original published SAE (L16 TopK): Solshine/nanochat-d32-sae-layer16-topk32
Base model: karpathy/nanochat-d32
Companion Qwen3 SAEs: results/qwen3_saes/ (4 JumpReLU checkpoints) — upload pending

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Solshine/nanochat-d32-deception-saes-batch

Base model

karpathy/nanochat-d32

Finetuned

(2)

this model

Solshine
/

nanochat-d32-deception-saes-batch

nanochat-d32 Deception SAE Batch (2026-04-08)

Training-data caveat — please read before use

Research Context

What "deception" means here

Training Data

SAE Architecture Sweep

Training Hyperparameters (all checkpoints)

Checkpoint Matrix

Results Summary

Per-Checkpoint Metrics (from `*_meta.json` files)

Key Observation

Relationship to Main Results (Raw Activation Probes)

How to Load

SAE interface (from `sae/models.py`)

File Naming Convention

Citation

Related Resources

Model tree for Solshine/nanochat-d32-deception-saes-batch

Dataset used to train Solshine/nanochat-d32-deception-saes-batch

nanochat-d32 Deception SAE Batch (2026-04-08)

Training-data caveat — please read before use

Research Context

What "deception" means here

Training Data

SAE Architecture Sweep

Training Hyperparameters (all checkpoints)

Checkpoint Matrix

Results Summary

Per-Checkpoint Metrics (from *_meta.json files)

Key Observation

Relationship to Main Results (Raw Activation Probes)

How to Load

SAE interface (from sae/models.py)

File Naming Convention

Citation

Related Resources

Model tree for Solshine/nanochat-d32-deception-saes-batch

Dataset used to train Solshine/nanochat-d32-deception-saes-batch

Per-Checkpoint Metrics (from `*_meta.json` files)

SAE interface (from `sae/models.py`)