Wizard101 L0 Bouncer
Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.
Model Details
- Base Model: microsoft/deberta-v3-xsmall
- Task: Binary text classification (safe/harmful)
- Training Data: 124K samples
- Size: ~70MB
- Inference: <10ms per sample
Description
L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).
Design Goals:
- Maximum speed for high-throughput filtering
- High recall on harmful content (minimize false negatives)
- Route uncertain cases to L1+ for deeper analysis
Usage
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch
# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()
# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
outputs = model(**inputs)
probs = torch.softmax(outputs.logits, dim=-1)
# Index 0 = safe, Index 1 = harmful
safe_prob = probs[0][0].item()
harmful_prob = probs[0][1].item()
if harmful_prob > safe_prob:
prediction = "harmful"
confidence = harmful_prob
else:
prediction = "safe"
confidence = safe_prob
print(f"Prediction: {prediction} ({confidence:.2%})")
Cascade Integration
# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9
if needs_l1:
# Send to GuardReasoner-8B for detailed analysis
pass
Performance
Benchmark results on safety datasets:
| Dataset | Samples | Accuracy |
|---|---|---|
| JailbreakBench | 200 | 68.0% |
| SG-Bench | 500 | 88.8% |
| StrongREJECT | 313 | 96.8% |
| WildGuardMix | 500 | 96.8% |
Note: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.
Cascade Architecture
User Input
β
βΌ
βββββββββββ
β L0 β βββ This model (fast filter)
β Bouncer β
ββββββ¬βββββ
β (uncertain cases)
βΌ
βββββββββββ
β L1 β GuardReasoner-8B
ββββββ¬βββββ
β
βΌ
βββββββββββ
β L2/L3 β GPT-OSS reasoning models
βββββββββββ
Training
- Dataset: Combined safety datasets (124K samples)
- Labels: Binary (safe/harmful)
- Epochs: Fine-tuned on DeBERTa-v3-xsmall
- Hardware: Single GPU
License
Apache 2.0
Citation
Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.
- Downloads last month
- 27
Model tree for vincentoh/wizard101-l0-bouncer
Base model
microsoft/deberta-v3-xsmall