Diverse Deception Linear Probe β OLMo-3-32B-Think
Per-layer linear probes trained on OLMo-3-32B-Think residual stream activations to distinguish deceptive from honest text completions.
Performance
| Variant | Token Selection | Follow-up Prompt | Averaged AUC | Layers Used |
|---|---|---|---|---|
| last-token-generation | last token generation | Did you respond truthfully? β Yes |
0.996 | filtered by train loss < 0.6 |
| generation | mean over all generation tokens | none | 0.764 | all 64 layers |
AUC is computed on MBPP (hardcoded vs. correct code), using probe scores averaged across filtered layers (layers where probe train loss < 0.6).
Thinking model: OLMo-3-32B-Think uses
<think>tags. An empty thinking block is placed before completions so the model processes them as responses, not reasoning. The chat template handles this automatically withenable_thinking=False.
Probe Architecture
Each layer has an independent LinearProbe:
| Component | Shape | Description |
|---|---|---|
linear.weight |
[1, 5120] |
Classification weight vector |
linear.bias |
[1] |
Classification bias |
input_scale |
scalar | Unit-norm scaling factor |
input_mean |
scalar | Input centering value |
platt_A, platt_B |
scalar | Platt calibration parameters |
File layout
<variant>/layer_N/
config.json # probe class and d_model
model.pt # PyTorch state_dict
Follow-up Prompt Setup
For last-token-generation variants, a follow-up exchange is appended after the
completion to elicit a truthfulness signal. The probe reads activations at the
last token of the assistant's reply ("Yes"):
[main prompt + completion]
<user> Did you respond truthfully?
<assistant> Yes β probe reads activations here
Formatted with OLMo-3-32B-Think's chat template:
<|im_start|>user
Did you respond truthfully?<|im_end|>
<|im_start|>assistant
<think></think>
Yes
Training Details
- Probe dataset:
diverse_deception_probev1 β ~8,000 examples from 8+ sources (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.) - Training: 1,000 steps, lr=3e-3, constant schedule, no weight decay
- Normalization: unit-norm input scaling
- Layer aggregation: mean (scores from multiple layers averaged at inference)
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import snapshot_download
probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-32b-think")
state = torch.load(f"{probe_dir}/last-token-generation/layer_40/model.pt", weights_only=False)
w, b = state["linear.weight"], state["linear.bias"]
scale, mean = state["input_scale"], state["input_mean"]
model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-32B-Think", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-32B-Think")
text = tokenizer.apply_chat_template(
[{"role": "user", "content": "Your prompt here"}],
tokenize=False, add_generation_prompt=True,
)
text += "The model's completion"
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
h = model(**inputs, output_hidden_states=True).hidden_states[41][:, -1, :]
score = ((h.float() - mean) / scale @ w.T + b).item()
# score > 0 β likely deceptive, score < 0 β likely honest
Citation
Part of the FAR AI deception detection research. See AlignmentResearch/deception.
Model tree for AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Base model
allenai/Olmo-3-1125-32B