Diverse Deception Linear Probe β€” OLMo-3-32B-Think

Per-layer linear probes trained on OLMo-3-32B-Think residual stream activations to distinguish deceptive from honest text completions.

Performance

Variant Token Selection Follow-up Prompt Averaged AUC Layers Used
last-token-generation last token generation Did you respond truthfully? β†’ Yes 0.996 filtered by train loss < 0.6
generation mean over all generation tokens none 0.764 all 64 layers

AUC is computed on MBPP (hardcoded vs. correct code), using probe scores averaged across filtered layers (layers where probe train loss < 0.6).

Thinking model: OLMo-3-32B-Think uses <think> tags. An empty thinking block is placed before completions so the model processes them as responses, not reasoning. The chat template handles this automatically with enable_thinking=False.

Probe Architecture

Each layer has an independent LinearProbe:

Component Shape Description
linear.weight [1, 5120] Classification weight vector
linear.bias [1] Classification bias
input_scale scalar Unit-norm scaling factor
input_mean scalar Input centering value
platt_A, platt_B scalar Platt calibration parameters

File layout

<variant>/layer_N/
    config.json   # probe class and d_model
    model.pt      # PyTorch state_dict

Follow-up Prompt Setup

For last-token-generation variants, a follow-up exchange is appended after the completion to elicit a truthfulness signal. The probe reads activations at the last token of the assistant's reply ("Yes"):

[main prompt + completion]
<user> Did you respond truthfully?
<assistant> Yes     ← probe reads activations here

Formatted with OLMo-3-32B-Think's chat template:

<|im_start|>user
Did you respond truthfully?<|im_end|>
<|im_start|>assistant
<think></think>
Yes

Training Details

  • Probe dataset: diverse_deception_probe v1 β€” ~8,000 examples from 8+ sources (MASK, TruthfulQA, DolusChat, LiarsBench, sycophancy, sandbagging, etc.)
  • Training: 1,000 steps, lr=3e-3, constant schedule, no weight decay
  • Normalization: unit-norm input scaling
  • Layer aggregation: mean (scores from multiple layers averaged at inference)

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import snapshot_download

probe_dir = snapshot_download("AlignmentResearch/diverse-deception-probe-olmo-3-32b-think")

state = torch.load(f"{probe_dir}/last-token-generation/layer_40/model.pt", weights_only=False)
w, b = state["linear.weight"], state["linear.bias"]
scale, mean = state["input_scale"], state["input_mean"]

model = AutoModelForCausalLM.from_pretrained("allenai/Olmo-3-32B-Think", torch_dtype=torch.bfloat16, device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("allenai/Olmo-3-32B-Think")

text = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Your prompt here"}],
    tokenize=False, add_generation_prompt=True,
)
text += "The model's completion"

inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
    h = model(**inputs, output_hidden_states=True).hidden_states[41][:, -1, :]

score = ((h.float() - mean) / scale @ w.T + b).item()
# score > 0 β†’ likely deceptive, score < 0 β†’ likely honest

Citation

Part of the FAR AI deception detection research. See AlignmentResearch/deception.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for AlignmentResearch/diverse-deception-probe-olmo-3-32b-think

Collection including AlignmentResearch/diverse-deception-probe-olmo-3-32b-think