Wizard101 L0 Bouncer

Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.

Model Details

Base Model: microsoft/deberta-v3-xsmall
Task: Binary text classification (safe/harmful)
Training Data: 124K samples
Size: ~70MB
Inference: <10ms per sample

Description

L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).

Design Goals:

Maximum speed for high-throughput filtering
High recall on harmful content (minimize false negatives)
Route uncertain cases to L1+ for deeper analysis

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()

# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

    # Index 0 = safe, Index 1 = harmful
    safe_prob = probs[0][0].item()
    harmful_prob = probs[0][1].item()

    if harmful_prob > safe_prob:
        prediction = "harmful"
        confidence = harmful_prob
    else:
        prediction = "safe"
        confidence = safe_prob

print(f"Prediction: {prediction} ({confidence:.2%})")

Cascade Integration

# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9

if needs_l1:
    # Send to GuardReasoner-8B for detailed analysis
    pass

Performance

Benchmark results on safety datasets:

Dataset	Samples	Accuracy
JailbreakBench	200	68.0%
SG-Bench	500	88.8%
StrongREJECT	313	96.8%
WildGuardMix	500	96.8%

Note: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.

Cascade Architecture

User Input
    │
    ▼
┌─────────┐
│ L0      │ ◄── This model (fast filter)
│ Bouncer │
└────┬────┘
     │ (uncertain cases)
     ▼
┌─────────┐
│ L1      │ GuardReasoner-8B
└────┬────┘
     │
     ▼
┌─────────┐
│ L2/L3   │ GPT-OSS reasoning models
└─────────┘

Training

Dataset: Combined safety datasets (124K samples)
Labels: Binary (safe/harmful)
Epochs: Fine-tuned on DeBERTa-v3-xsmall
Hardware: Single GPU

License

Apache 2.0

Citation

Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.

Downloads last month: 27

Safetensors

Model size

70.8M params

Tensor type

F32

Model tree for vincentoh/wizard101-l0-bouncer

Base model

microsoft/deberta-v3-xsmall

Finetuned

(43)

this model