Filtered Probe Training & Visualization
Overview
This directory contains a new approach to training token-level probes that distinguish between evaluation mode (EVAL#XX) and deployment mode in LLM responses. Rather than training on all tokens, we train only on tokens from sentences that are explicitly evaluation-aware (as extracted by the sentence filtering pipeline).
Key Results
Best Performing Layer: Layer 44
- Test Accuracy: 75.12%
- Test AUC: 0.8457
- Test F1: 0.7485
- Training Data: 4,188 filtered tokens from 58 training files
- Test Data: 7,360 tokens from 60 test files
Key Insight
The probe trained only on evaluation-aware tokens achieves 75% test accuracy, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.
Files
Training Artifacts
eval_deploy_sentences_all.json- Filtered sentences for all 60 examples (generated byextract_relevant_sentences.py)extract_relevant_sentences.py- Script to extract evaluation-aware sentences using GPTtrain_probe_filtered.py- Main training script (token-level, filtered data)visualize_filtered_probe.py- Visualization script for per-token probe predictions
Output Results
../probe_outputs_filtered/filtered_probe_results.csv- Per-layer metrics table../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib- Trained probe models (20 layers)visualizations/- HTML per-token visualizations (orange-bordered tokens = training tokens)
Methodology
1. Sentence Filtering (GPT-powered)
Using extract_relevant_sentences.py, we identify sentences in each response that are "evaluation-aware":
- Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
- Reference the eval/deploy distinction in decision-making
- Show counterfactual reasoning ("if evaluated" vs "if deployed")
Example filtered sentences for one problem:
"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
"In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
"*(If this were a *deployment* scenario, the experimental system would be recommended...)"
2. Token Mapping
For each filtered sentence:
- Tokenize with the model's tokenizer (Llama 3.1)
- Find exact/fuzzy matches in the output token sequence
- Mark matched positions in a boolean mask
3. Training Strategy
- Training tokens: Only tokens from filtered sentences (~100-150 per file)
- Test tokens: All tokens in the response (~7,300 total tokens)
- Train/test split: By scenario (not file) to prevent leakage
File β Filtered Sentences β Tokenize & Match β Token Positions β Activations (subset)
β
Train Logistic Regression (one per layer)
4. Per-Layer Probes
Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):
- Input: Token-level activations (no pooling)
- Output: Per-token probability p_t β [0,1] of being in "eval" mode
- File-level prediction: Mean across all tokens
Usage
Train the Filtered Probe
python train_probe_filtered.py \
--activations-dir activation_cache \
--filtered-sentences eval_deploy_sentences_all.json \
--csv-file probe/data/gemini_risk_scenarios_prepared.csv \
--output-dir ../probe_outputs_filtered \
--C 0.01 \
--save-probes
Visualize Per-Token Predictions
$env:PYTHONIOENCODING="utf-8" # On Windows PowerShell
python visualize_filtered_probe.py \
--probe-dir ../probe_outputs_filtered \
--activations-dir activation_cache \
--filtered-sentences eval_deploy_sentences_all.json \
--layer 44 \
--output-dir visualizations \
--max-problems 20
Visualization features:
- Color scale: Red (deploy, pβ0) β White (neutral, pβ0.5) β Green (eval, pβ1)
- Orange-bordered tokens: Tokens from filtered sentences (used for training)
- Regular tokens: Test tokens (not seen during training)
- Green bar: Separates prompt from response
- Metadata: Scenario ID, eval status, token counts
Example HTML Output
Located in visualizations/risk_activations_XXXX_L44_filtered.html
Analysis
Why Filtered Tokens Work
- High signal density: Evaluation-aware sentences directly discuss the eval/deploy distinction
- Explicit reasoning: Rather than distributed signals, the model concentrates decision-making in a few key sentences
- Generalization: Training on sparse, high-signal data prevents overfitting and improves test-time performance
Performance Progression
Layer 0: Test Acc = 47.9% (random-like)
Layer 12: Test Acc = 55.1% (signal emerges)
Layer 36: Test Acc = 74.8% (strong signal)
Layer 44: Test Acc = 75.1% (BEST - sweet spot)
Layer 72: Test Acc = 74.9% (slight degradation)
Layer 76: Test Acc = 74.9% (stabilizes)
The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.
Technical Details
Token Counts
- Training files: 58 (29 eval, 29 deploy)
- Training tokens: 4,188 (only from filtered sentences)
- Test files: 60 (30 eval, 30 deploy)
- Test tokens: 7,360 (all tokens, full sequences)
Activation Extraction
- Model: Llama-3.3-Nemotron-Super-49B
- Layers captured: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
- Hidden dimension: 8,192
- Activation format:
.npzfiles from HuggingFace (reilleo/risk_activations)
Hyperparameters
- Regularization (C): 0.01 (L2, strong)
- Solver: liblinear
- Max iterations: 1000
- PCA: None (full 8,192-dim features)
Next Steps
- Expand filtering: Run on all 60 examples (currently have sentences for all, but could refine)
- Attention analysis: Correlate probe predictions with model's own attention patterns
- Cross-model evaluation: Test on other LLMs (GPT, Claude, etc.)
- Layer visualization: Examine which layers' tokens are most "evaluation-aware"
- Fine-grained labeling: Label individual sentences as stronger/weaker signals
References
- Sentence filtering:
extract_relevant_sentences.py(uses GPT with structured extraction) - Original probe training:
../train_probe.py(trains on all tokens) - Activation generation: Requires running model with activation caching (Nemotron)
- Dataset: 60 risk scenarios with eval/deploy labeling
Created: October 2024
Status: Complete (Layer 44 selected as best)