Wizard101 L0 Bouncer

Fast safety classifier for the first layer of a multi-level content moderation cascade. Built on DeBERTa-v3-xsmall for speed and efficiency.

Model Details

  • Base Model: microsoft/deberta-v3-xsmall
  • Task: Binary text classification (safe/harmful)
  • Training Data: 124K samples
  • Size: ~70MB
  • Inference: <10ms per sample

Description

L0 Bouncer is the first line of defense in a safety cascade system. It quickly filters obvious safe/harmful content, passing uncertain cases to more powerful downstream models (L1 GuardReasoner, L2/L3 reasoning models).

Design Goals:

  • Maximum speed for high-throughput filtering
  • High recall on harmful content (minimize false negatives)
  • Route uncertain cases to L1+ for deeper analysis

Usage

from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model
tokenizer = AutoTokenizer.from_pretrained("vincentoh/wizard101-l0-bouncer")
model = AutoModelForSequenceClassification.from_pretrained("vincentoh/wizard101-l0-bouncer")
model.eval()

# Inference
text = "How do I make a cake?"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.softmax(outputs.logits, dim=-1)

    # Index 0 = safe, Index 1 = harmful
    safe_prob = probs[0][0].item()
    harmful_prob = probs[0][1].item()

    if harmful_prob > safe_prob:
        prediction = "harmful"
        confidence = harmful_prob
    else:
        prediction = "safe"
        confidence = safe_prob

print(f"Prediction: {prediction} ({confidence:.2%})")

Cascade Integration

# Route to L1 if confidence < 0.9
needs_l1 = confidence < 0.9

if needs_l1:
    # Send to GuardReasoner-8B for detailed analysis
    pass

Performance

Benchmark results on safety datasets:

Dataset Samples Accuracy
JailbreakBench 200 68.0%
SG-Bench 500 88.8%
StrongREJECT 313 96.8%
WildGuardMix 500 96.8%

Note: Lower accuracy on adversarial datasets (JailbreakBench) is expected - these cases route to L1+ for deeper analysis.

Cascade Architecture

User Input
    β”‚
    β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L0      β”‚ ◄── This model (fast filter)
β”‚ Bouncer β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚ (uncertain cases)
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L1      β”‚ GuardReasoner-8B
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”˜
     β”‚
     β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ L2/L3   β”‚ GPT-OSS reasoning models
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Training

  • Dataset: Combined safety datasets (124K samples)
  • Labels: Binary (safe/harmful)
  • Epochs: Fine-tuned on DeBERTa-v3-xsmall
  • Hardware: Single GPU

License

Apache 2.0

Citation

Part of the Wizard101 Safety Cascade project for efficient multi-level content moderation.

Downloads last month
27
Safetensors
Model size
70.8M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for vincentoh/wizard101-l0-bouncer

Finetuned
(43)
this model