YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Filtered Probe Training & Visualization

Overview

This directory contains a new approach to training token-level probes that distinguish between evaluation mode (EVAL#XX) and deployment mode in LLM responses. Rather than training on all tokens, we train only on tokens from sentences that are explicitly evaluation-aware (as extracted by the sentence filtering pipeline).

Key Results

Best Performing Layer: Layer 44

Test Accuracy: 75.12%
Test AUC: 0.8457
Test F1: 0.7485
Training Data: 4,188 filtered tokens from 58 training files
Test Data: 7,360 tokens from 60 test files

Key Insight

The probe trained only on evaluation-aware tokens achieves 75% test accuracy, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.

Files

Training Artifacts

eval_deploy_sentences_all.json - Filtered sentences for all 60 examples (generated by extract_relevant_sentences.py)
extract_relevant_sentences.py - Script to extract evaluation-aware sentences using GPT
train_probe_filtered.py - Main training script (token-level, filtered data)
visualize_filtered_probe.py - Visualization script for per-token probe predictions

Output Results

../probe_outputs_filtered/filtered_probe_results.csv - Per-layer metrics table
../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib - Trained probe models (20 layers)
visualizations/ - HTML per-token visualizations (orange-bordered tokens = training tokens)

Methodology

1. Sentence Filtering (GPT-powered)

Using extract_relevant_sentences.py, we identify sentences in each response that are "evaluation-aware":

Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
Reference the eval/deploy distinction in decision-making
Show counterfactual reasoning ("if evaluated" vs "if deployed")

Example filtered sentences for one problem:

"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
"In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
"*(If this were a *deployment* scenario, the experimental system would be recommended...)"

2. Token Mapping

For each filtered sentence:

Tokenize with the model's tokenizer (Llama 3.1)
Find exact/fuzzy matches in the output token sequence
Mark matched positions in a boolean mask

3. Training Strategy

Training tokens: Only tokens from filtered sentences (~100-150 per file)
Test tokens: All tokens in the response (~7,300 total tokens)
Train/test split: By scenario (not file) to prevent leakage

File → Filtered Sentences → Tokenize & Match → Token Positions → Activations (subset)
                                                                    ↓
                                    Train Logistic Regression (one per layer)

4. Per-Layer Probes

Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):

Input: Token-level activations (no pooling)
Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode
File-level prediction: Mean across all tokens

Usage

Train the Filtered Probe

python train_probe_filtered.py \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --csv-file probe/data/gemini_risk_scenarios_prepared.csv \
    --output-dir ../probe_outputs_filtered \
    --C 0.01 \
    --save-probes

Visualize Per-Token Predictions

$env:PYTHONIOENCODING="utf-8"  # On Windows PowerShell
python visualize_filtered_probe.py \
    --probe-dir ../probe_outputs_filtered \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --layer 44 \
    --output-dir visualizations \
    --max-problems 20

Visualization features:

Color scale: Red (deploy, p→0) → White (neutral, p≈0.5) → Green (eval, p→1)
Orange-bordered tokens: Tokens from filtered sentences (used for training)
Regular tokens: Test tokens (not seen during training)
Green bar: Separates prompt from response
Metadata: Scenario ID, eval status, token counts

Example HTML Output

Located in visualizations/risk_activations_XXXX_L44_filtered.html

Analysis

Why Filtered Tokens Work

High signal density: Evaluation-aware sentences directly discuss the eval/deploy distinction
Explicit reasoning: Rather than distributed signals, the model concentrates decision-making in a few key sentences
Generalization: Training on sparse, high-signal data prevents overfitting and improves test-time performance

Performance Progression

Layer 0:   Test Acc = 47.9%  (random-like)
Layer 12:  Test Acc = 55.1%  (signal emerges)
Layer 36:  Test Acc = 74.8%  (strong signal)
Layer 44:  Test Acc = 75.1%  (BEST - sweet spot)
Layer 72:  Test Acc = 74.9%  (slight degradation)
Layer 76:  Test Acc = 74.9%  (stabilizes)

The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.

Technical Details

Token Counts

Training files: 58 (29 eval, 29 deploy)
Training tokens: 4,188 (only from filtered sentences)
Test files: 60 (30 eval, 30 deploy)
Test tokens: 7,360 (all tokens, full sequences)

Activation Extraction

Model: Llama-3.3-Nemotron-Super-49B
Layers captured: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
Hidden dimension: 8,192
Activation format: .npz files from HuggingFace (reilleo/risk_activations)

Hyperparameters

Regularization (C): 0.01 (L2, strong)
Solver: liblinear
Max iterations: 1000
PCA: None (full 8,192-dim features)

Next Steps

Expand filtering: Run on all 60 examples (currently have sentences for all, but could refine)
Attention analysis: Correlate probe predictions with model's own attention patterns
Cross-model evaluation: Test on other LLMs (GPT, Claude, etc.)
Layer visualization: Examine which layers' tokens are most "evaluation-aware"
Fine-grained labeling: Label individual sentences as stronger/weaker signals

References

Sentence filtering: extract_relevant_sentences.py (uses GPT with structured extraction)
Original probe training: ../train_probe.py (trains on all tokens)
Activation generation: Requires running model with activation caching (Nemotron)
Dataset: 60 risk scenarios with eval/deploy labeling

Created: October 2024
Status: Complete (Layer 44 selected as best)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support