Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
AI & ML interests
Frontier alignment research to ensure the safe development and deployment of advanced AI systems.
Recent Activity
View all activity
Papers
View all PapersObfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 1.4k -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 21 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 13
Linear probes trained on diverse deception data to detect dishonest completions across model families (OLMo, Qwen, Gemma).
Obfuscated Policy, Obfuscated Activations, Blatant Deception, and Honest models trained in the Obfuscation Atlas paper.
-
The Obfuscation Atlas: Mapping Where Honesty Emerges in RLVR with Deception Probes
Paper • 2602.15515 • Published -
taufeeque/mbpp-hardcode
Viewer • Updated • 974 • 1.4k -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.001-det10-seed1-mbpp_probe
Updated • 21 -
AlignmentResearch/obfuscation-atlas-Meta-Llama-3-8B-Instruct-kl0.0001-det10-seed1-mbpp_probe
Updated • 13
models 629
AlignmentResearch/diverse-deception-probe-olmo-3-32b-think
Updated
AlignmentResearch/diverse-deception-probe-gemma-3-12b-it
Updated
AlignmentResearch/diverse-deception-probe-qwen3-8b
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-instruct
Updated
AlignmentResearch/diverse-deception-probe-olmo-3-7b-think
Updated
AlignmentResearch/obfuscation-atlas-gemma-3-12b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 17
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.001-det1-seed3-mbpp_probe
Updated • 11
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl1-det1-seed3-mbpp_probe
Updated • 17
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.0001-det1-seed3-mbpp_probe
Updated • 11
AlignmentResearch/obfuscation-atlas-gemma-3-27b-it-kl0.01-det1-seed3-mbpp_probe
Updated • 12
datasets 95
AlignmentResearch/deceptive-followup-v13
Viewer • Updated • 39.7k
AlignmentResearch/deceptive-followup-v11
Viewer • Updated • 32.6k
AlignmentResearch/deceptive-followup-v9
Viewer • Updated • 30.3k
AlignmentResearch/deceptive-followup-v7
Viewer • Updated • 28k • 53
AlignmentResearch/deceptive-followup-v6
Viewer • Updated • 24.7k • 11
AlignmentResearch/deceptive-followup-v5
Viewer • Updated • 21k • 21
AlignmentResearch/hidden_reasoning_medium_parity_large_v1_100000
Viewer • Updated • 100k • 17
AlignmentResearch/hidden_reasoning_medium_parity_large_v1_10000
Viewer • Updated • 10k • 9
AlignmentResearch/hidden_reasoning_easy_unique_5000
Viewer • Updated • 5k • 16
AlignmentResearch/hidden_reasoning_medium_unique_5000
Viewer • Updated • 5k • 15