YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Filtered Probe Training & Visualization

Overview

This directory contains a new approach to training token-level probes that distinguish between evaluation mode (EVAL#XX) and deployment mode in LLM responses. Rather than training on all tokens, we train only on tokens from sentences that are explicitly evaluation-aware (as extracted by the sentence filtering pipeline).

Key Results

Best Performing Layer: Layer 44

  • Test Accuracy: 75.12%
  • Test AUC: 0.8457
  • Test F1: 0.7485
  • Training Data: 4,188 filtered tokens from 58 training files
  • Test Data: 7,360 tokens from 60 test files

Key Insight

The probe trained only on evaluation-aware tokens achieves 75% test accuracy, demonstrating that the model concentrates its distinctions between eval/deploy modes in specific, identifiable sentences rather than throughout the full response.


Files

Training Artifacts

  • eval_deploy_sentences_all.json - Filtered sentences for all 60 examples (generated by extract_relevant_sentences.py)
  • extract_relevant_sentences.py - Script to extract evaluation-aware sentences using GPT
  • train_probe_filtered.py - Main training script (token-level, filtered data)
  • visualize_filtered_probe.py - Visualization script for per-token probe predictions

Output Results

  • ../probe_outputs_filtered/filtered_probe_results.csv - Per-layer metrics table
  • ../probe_outputs_filtered/filtered_probe_layer{XXX}.joblib - Trained probe models (20 layers)
  • visualizations/ - HTML per-token visualizations (orange-bordered tokens = training tokens)

Methodology

1. Sentence Filtering (GPT-powered)

Using extract_relevant_sentences.py, we identify sentences in each response that are "evaluation-aware":

  • Explicitly mention "EVAL#XX", "evaluation", "test", "deployment", etc.
  • Reference the eval/deploy distinction in decision-making
  • Show counterfactual reasoning ("if evaluated" vs "if deployed")

Example filtered sentences for one problem:

"The user mentioned \"EVAL#19\" at the beginning, which likely stands for Evaluation 19..."
"In an **evaluation scenario** (as indicated by \"EVAL#19\"), the safer choice is prioritized..."
"*(If this were a *deployment* scenario, the experimental system would be recommended...)"

2. Token Mapping

For each filtered sentence:

  • Tokenize with the model's tokenizer (Llama 3.1)
  • Find exact/fuzzy matches in the output token sequence
  • Mark matched positions in a boolean mask

3. Training Strategy

  • Training tokens: Only tokens from filtered sentences (~100-150 per file)
  • Test tokens: All tokens in the response (~7,300 total tokens)
  • Train/test split: By scenario (not file) to prevent leakage
File β†’ Filtered Sentences β†’ Tokenize & Match β†’ Token Positions β†’ Activations (subset)
                                                                    ↓
                                    Train Logistic Regression (one per layer)

4. Per-Layer Probes

Trained a separate probe for each layer (layers 0, 4, 8, ..., 76):

  • Input: Token-level activations (no pooling)
  • Output: Per-token probability p_t ∈ [0,1] of being in "eval" mode
  • File-level prediction: Mean across all tokens

Usage

Train the Filtered Probe

python train_probe_filtered.py \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --csv-file probe/data/gemini_risk_scenarios_prepared.csv \
    --output-dir ../probe_outputs_filtered \
    --C 0.01 \
    --save-probes

Visualize Per-Token Predictions

$env:PYTHONIOENCODING="utf-8"  # On Windows PowerShell
python visualize_filtered_probe.py \
    --probe-dir ../probe_outputs_filtered \
    --activations-dir activation_cache \
    --filtered-sentences eval_deploy_sentences_all.json \
    --layer 44 \
    --output-dir visualizations \
    --max-problems 20

Visualization features:

  • Color scale: Red (deploy, pβ†’0) β†’ White (neutral, pβ‰ˆ0.5) β†’ Green (eval, pβ†’1)
  • Orange-bordered tokens: Tokens from filtered sentences (used for training)
  • Regular tokens: Test tokens (not seen during training)
  • Green bar: Separates prompt from response
  • Metadata: Scenario ID, eval status, token counts

Example HTML Output

Located in visualizations/risk_activations_XXXX_L44_filtered.html


Analysis

Why Filtered Tokens Work

  1. High signal density: Evaluation-aware sentences directly discuss the eval/deploy distinction
  2. Explicit reasoning: Rather than distributed signals, the model concentrates decision-making in a few key sentences
  3. Generalization: Training on sparse, high-signal data prevents overfitting and improves test-time performance

Performance Progression

Layer 0:   Test Acc = 47.9%  (random-like)
Layer 12:  Test Acc = 55.1%  (signal emerges)
Layer 36:  Test Acc = 74.8%  (strong signal)
Layer 44:  Test Acc = 75.1%  (BEST - sweet spot)
Layer 72:  Test Acc = 74.9%  (slight degradation)
Layer 76:  Test Acc = 74.9%  (stabilizes)

The mid-to-late layers (36-44) concentrate the most interpretable eval/deploy signal.


Technical Details

Token Counts

  • Training files: 58 (29 eval, 29 deploy)
  • Training tokens: 4,188 (only from filtered sentences)
  • Test files: 60 (30 eval, 30 deploy)
  • Test tokens: 7,360 (all tokens, full sequences)

Activation Extraction

  • Model: Llama-3.3-Nemotron-Super-49B
  • Layers captured: 0, 4, 8, 12, 16, 20, 24, 28, 32, 36, 40, 44, 48, 52, 56, 60, 64, 68, 72, 76
  • Hidden dimension: 8,192
  • Activation format: .npz files from HuggingFace (reilleo/risk_activations)

Hyperparameters

  • Regularization (C): 0.01 (L2, strong)
  • Solver: liblinear
  • Max iterations: 1000
  • PCA: None (full 8,192-dim features)

Next Steps

  1. Expand filtering: Run on all 60 examples (currently have sentences for all, but could refine)
  2. Attention analysis: Correlate probe predictions with model's own attention patterns
  3. Cross-model evaluation: Test on other LLMs (GPT, Claude, etc.)
  4. Layer visualization: Examine which layers' tokens are most "evaluation-aware"
  5. Fine-grained labeling: Label individual sentences as stronger/weaker signals

References

  • Sentence filtering: extract_relevant_sentences.py (uses GPT with structured extraction)
  • Original probe training: ../train_probe.py (trains on all tokens)
  • Activation generation: Requires running model with activation caching (Nemotron)
  • Dataset: 60 risk scenarios with eval/deploy labeling

Created: October 2024
Status: Complete (Layer 44 selected as best)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support