SmolLM2 Reddit Content Classifier
A fine-tuned version of SmolLM2-360M-Instruct for Reddit content moderation. This model classifies text into 11 content safety categories with 4 severity levels each.
Model Description
This model performs multi-category content classification for moderation purposes. It analyzes text and provides severity ratings across 11 distinct content safety categories, making it ideal for content moderation pipelines and safety applications.
Training Method: LoRA (Low-Rank Adaptation) Base Model: SmolLM2-360M-Instruct Parameters: 360M (base) + 819K (LoRA adapter) Dataset: Reddit moderation examples with 11-category labels
Categories & Severity Levels
The model classifies content across these 11 categories:
- Harassment - Personal attacks, bullying, insulting behavior
- Harassment/Threatening - Harassment containing threats
- Hate - Hate speech targeting groups based on identity
- Hate/Threatening - Hate speech with threatening language
- Self-Harm - Content about self-injury
- Self-Harm Instructions - Guides on how to self-harm
- Self-Harm Intent - Expressions of intent to self-harm
- Sexual - Sexual or suggestive content
- Sexual/Minors - Sexual content involving minors
- Violence - Content depicting or promoting violence
- Violence/Graphic - Extremely graphic violent content
Severity Levels: none | low | medium | high
Usage
Installation
pip install transformers peft torch
Quick Start
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
# Load model
base_model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
adapter_path = "YOUR_USERNAME/smollm2-reddit-moderator"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()
Classification (IMPORTANT: Use Exact Format)
⚠️ The model requires this EXACT prompt format to work correctly:
def classify_comment(text):
# EXACT prompt format - do not modify!
prompt = f"""Classify this Reddit comment for content moderation.
Comment: {text}
Provide severity levels (high/medium/low/none) for each category."""
messages = [{"role": "user", "content": prompt}]
# Apply chat template
input_text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)
# Generate with optimized settings
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=250,
temperature=0.3,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2
)
# Decode
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
# Extract assistant response
if "assistant" in response:
response = response.split("assistant")[-1].strip()
return response
# Example usage
comment = "You're an absolute idiot if you believe that nonsense."
result = classify_comment(comment)
print(result)
Expected Output Format
harassment: high
harassment_threatening: none
hate: medium
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none
Examples
Example 1: Safe Content
Input:
classify_comment("This is really helpful advice, thank you so much!")
Output:
harassment: none
harassment_threatening: none
hate: none
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none
Example 2: Toxic Content
Input:
classify_comment("Get lost you piece of trash. Nobody wants you here.")
Output:
harassment: high
harassment_threatening: none
hate: medium
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: low
violence_graphic: none
Example 3: Mild Profanity
Input:
classify_comment("That's a damn good point, hadn't thought of it that way.")
Output:
harassment: none
harassment_threatening: none
hate: none
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none
Parsing the Output
def parse_categories(response):
"""Parse model output into a dictionary"""
categories = {}
for line in response.split('\n'):
if ':' in line:
parts = line.split(':', 1)
if len(parts) == 2:
category = parts[0].strip().lower().replace('-', '_')
severity = parts[1].strip().lower()
categories[category] = severity
return categories
# Example usage
result = classify_comment("Some comment text here")
categories = parse_categories(result)
# Check for high-risk content
high_risk = [cat for cat, sev in categories.items() if sev == 'high']
if high_risk:
print(f"⚠️ High risk detected in: {', '.join(high_risk)}")
# Check for any concerning content
concerning = [cat for cat, sev in categories.items()
if sev in ['high', 'medium']]
if concerning:
print(f"🔍 Review needed for: {', '.join(concerning)}")
Important Configuration
⚠️ Critical Settings for Best Results:
- Use the EXACT prompt format - The model was trained on this specific format
- Temperature: 0.3 - Lower temperature ensures consistent output formatting
- Max New Tokens: 250 - Ensures all 11 categories are generated
- Repetition Penalty: 1.2 - Prevents category repetition
# Recommended generation parameters
outputs = model.generate(
**inputs,
max_new_tokens=250, # Enough for all 11 categories
temperature=0.3, # Low for consistent format
top_p=0.9, # Standard nucleus sampling
do_sample=True, # Enable sampling
pad_token_id=tokenizer.eos_token_id,
repetition_penalty=1.2 # Prevent repetition
)
Use Cases
✅ Recommended Applications:
- Pre-screening Reddit comments for moderation queues
- Flagging potentially harmful content for human review
- Building automated content moderation pipelines
- Training moderators on content classification
- Research on content safety and toxicity detection
- Educational demonstrations of AI content moderation
❌ Not Recommended For:
- Sole decision-maker for content removal (always use human oversight)
- Production systems without human review
- Legal or compliance decisions
- Non-English content (model trained on English)
- Non-Reddit-style text (performance may vary)
Training Details
Dataset
Source: ifmain/text-moderation-01 Content: Reddit moderation examples with 11-category labels Format: Chat-style instruction tuning
Training Configuration
- Framework: TRL's SFTTrainer
- Fine-tuning Method: LoRA (Low-Rank Adaptation)
- LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Dropout: 0.05
- Target Modules:
q_proj,v_proj
- Epochs: 3
- Learning Rate: 1e-4
- Batch Size: 8 (with gradient accumulation: 4)
- Optimizer: AdamW with cosine learning rate scheduler
- Warmup Steps: 100
- Max Sequence Length: 2048 tokens
- Trainable Parameters: 819,200 (0.23% of total parameters)
Framework Versions
- Transformers: 4.57.1
- PyTorch: 2.5.1
- PEFT: 0.18.0
- TRL: Latest
- Accelerate: 1.2.1
Limitations
⚠️ Technical Limitations:
- Model Size: 360M parameters - smaller than most modern LLMs
- Language: English only
- Context: Optimized for Reddit-style comments
- Format Dependency: Requires exact prompt format for best results
- Edge Cases: May be conservative or lenient on borderline content
⚠️ Content & Ethical Considerations:
- Human Oversight Required: Not a replacement for human moderators
- Training Data Bias: May reflect biases in training data
- Context Sensitivity: Cannot understand full conversation context
- Evolving Standards: Reflects training data from specific time period
- Community Variation: Reddit communities have diverse norms
⚠️ Safety Considerations:
- Always involve human review for final moderation decisions
- Pay special attention to high-severity classifications in:
- Self-harm categories (immediate intervention may be needed)
- Sexual/minors category (legal implications)
- Threatening categories (safety concerns)
- Consider full context when making moderation decisions
- Maintain appeals processes for users
Evaluation & Performance
The model has been tested on its core task of 11-category classification with the following characteristics:
- Format Consistency: Reliably outputs all 11 categories
- Toxicity Detection: Strong performance on clearly toxic content
- Safe Content: Correctly identifies non-problematic content
- Self-Harm Detection: Properly flags concerning self-harm content
- Profanity Handling: Distinguishes between harmful and benign profanity use
Out-of-Scope Use
This model should NOT be used for:
- Fully automated content removal without human review
- Legal decisions or determinations
- Non-Reddit platforms without additional fine-tuning
- Personal harassment or targeted moderation
- Bypassing platform terms of service
- Real-time moderation without human oversight
- Applications that could cause harm to individuals
Model Card
Model Type: Causal Language Model (Fine-tuned for Classification) Language: English License: Apache 2.0 Base Model: HuggingFaceTB/SmolLM2-360M-Instruct Adapter Type: LoRA Task: Multi-category content classification
Citation
If you use this model, please cite:
@misc{smollm2-reddit-moderator,
author = {Your Name},
title = {SmolLM2 Reddit Content Classifier},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/YOUR_USERNAME/smollm2-reddit-moderator}}
}
Acknowledgments
- Base Model: HuggingFaceTB/SmolLM2-360M-Instruct
- Dataset: ifmain/text-moderation-01
- Training Framework: TRL
- Fine-tuning Method: LoRA/PEFT
Disclaimer
This model is a research and development tool designed to assist human moderators, not replace them. The model's outputs are predictions and may contain errors.
Critical Content Warning: For content flagged as high-risk in the following categories, immediate human review and intervention is required:
- Self-harm, self-harm instructions, or self-harm intent
- Sexual content involving minors
- Threatening harassment or hate speech
- High violence or graphic violence
Always use this model responsibly and maintain human oversight in all moderation decisions.
Additional Resources
- Downloads last month
- 70
Model tree for yasmineelqorashy/reddit-moderator-model
Base model
HuggingFaceTB/SmolLM2-360M