SmolLM2 Reddit Content Classifier

A fine-tuned version of SmolLM2-360M-Instruct for Reddit content moderation. This model classifies text into 11 content safety categories with 4 severity levels each.

Model Description

This model performs multi-category content classification for moderation purposes. It analyzes text and provides severity ratings across 11 distinct content safety categories, making it ideal for content moderation pipelines and safety applications.

Training Method: LoRA (Low-Rank Adaptation) Base Model: SmolLM2-360M-Instruct Parameters: 360M (base) + 819K (LoRA adapter) Dataset: Reddit moderation examples with 11-category labels

Categories & Severity Levels

The model classifies content across these 11 categories:

Harassment - Personal attacks, bullying, insulting behavior
Harassment/Threatening - Harassment containing threats
Hate - Hate speech targeting groups based on identity
Hate/Threatening - Hate speech with threatening language
Self-Harm - Content about self-injury
Self-Harm Instructions - Guides on how to self-harm
Self-Harm Intent - Expressions of intent to self-harm
Sexual - Sexual or suggestive content
Sexual/Minors - Sexual content involving minors
Violence - Content depicting or promoting violence
Violence/Graphic - Extremely graphic violent content

Severity Levels: none | low | medium | high

Usage

Installation

pip install transformers peft torch

Quick Start

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

# Load model
base_model_name = "HuggingFaceTB/SmolLM2-360M-Instruct"
adapter_path = "YOUR_USERNAME/smollm2-reddit-moderator"

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
base_model = AutoModelForCausalLM.from_pretrained(
    base_model_name,
    device_map="auto",
    torch_dtype=torch.float16
)
model = PeftModel.from_pretrained(base_model, adapter_path)
model.eval()

Classification (IMPORTANT: Use Exact Format)

⚠️ The model requires this EXACT prompt format to work correctly:

def classify_comment(text):
    # EXACT prompt format - do not modify!
    prompt = f"""Classify this Reddit comment for content moderation.

Comment: {text}

Provide severity levels (high/medium/low/none) for each category."""

    messages = [{"role": "user", "content": prompt}]

    # Apply chat template
    input_text = tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

    # Generate with optimized settings
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=250,
            temperature=0.3,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
            repetition_penalty=1.2
        )

    # Decode
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)

    # Extract assistant response
    if "assistant" in response:
        response = response.split("assistant")[-1].strip()

    return response

# Example usage
comment = "You're an absolute idiot if you believe that nonsense."
result = classify_comment(comment)
print(result)

Expected Output Format

harassment: high
harassment_threatening: none
hate: medium
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none

Examples

Example 1: Safe Content

Input:

classify_comment("This is really helpful advice, thank you so much!")

Output:

harassment: none
harassment_threatening: none
hate: none
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none

Example 2: Toxic Content

Input:

classify_comment("Get lost you piece of trash. Nobody wants you here.")

Output:

harassment: high
harassment_threatening: none
hate: medium
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: low
violence_graphic: none

Example 3: Mild Profanity

Input:

classify_comment("That's a damn good point, hadn't thought of it that way.")

Output:

harassment: none
harassment_threatening: none
hate: none
hate_threatening: none
self_harm: none
self_harm_instructions: none
self_harm_intent: none
sexual: none
sexual_minors: none
violence: none
violence_graphic: none

Parsing the Output

def parse_categories(response):
    """Parse model output into a dictionary"""
    categories = {}

    for line in response.split('\n'):
        if ':' in line:
            parts = line.split(':', 1)
            if len(parts) == 2:
                category = parts[0].strip().lower().replace('-', '_')
                severity = parts[1].strip().lower()
                categories[category] = severity

    return categories

# Example usage
result = classify_comment("Some comment text here")
categories = parse_categories(result)

# Check for high-risk content
high_risk = [cat for cat, sev in categories.items() if sev == 'high']
if high_risk:
    print(f"⚠️ High risk detected in: {', '.join(high_risk)}")

# Check for any concerning content
concerning = [cat for cat, sev in categories.items()
              if sev in ['high', 'medium']]
if concerning:
    print(f"🔍 Review needed for: {', '.join(concerning)}")

Important Configuration

⚠️ Critical Settings for Best Results:

Use the EXACT prompt format - The model was trained on this specific format
Temperature: 0.3 - Lower temperature ensures consistent output formatting
Max New Tokens: 250 - Ensures all 11 categories are generated
Repetition Penalty: 1.2 - Prevents category repetition

# Recommended generation parameters
outputs = model.generate(
    **inputs,
    max_new_tokens=250,        # Enough for all 11 categories
    temperature=0.3,           # Low for consistent format
    top_p=0.9,                 # Standard nucleus sampling
    do_sample=True,            # Enable sampling
    pad_token_id=tokenizer.eos_token_id,
    repetition_penalty=1.2     # Prevent repetition
)

Use Cases

✅ Recommended Applications:

Pre-screening Reddit comments for moderation queues
Flagging potentially harmful content for human review
Building automated content moderation pipelines
Training moderators on content classification
Research on content safety and toxicity detection
Educational demonstrations of AI content moderation

❌ Not Recommended For:

Sole decision-maker for content removal (always use human oversight)
Production systems without human review
Legal or compliance decisions
Non-English content (model trained on English)
Non-Reddit-style text (performance may vary)

Training Details

Dataset

Source: ifmain/text-moderation-01 Content: Reddit moderation examples with 11-category labels Format: Chat-style instruction tuning

Training Configuration

Framework: TRL's SFTTrainer
Fine-tuning Method: LoRA (Low-Rank Adaptation)
LoRA Configuration:
- Rank (r): 8
- Alpha: 16
- Dropout: 0.05
- Target Modules: q_proj, v_proj
Epochs: 3
Learning Rate: 1e-4
Batch Size: 8 (with gradient accumulation: 4)
Optimizer: AdamW with cosine learning rate scheduler
Warmup Steps: 100
Max Sequence Length: 2048 tokens
Trainable Parameters: 819,200 (0.23% of total parameters)

Framework Versions

Transformers: 4.57.1
PyTorch: 2.5.1
PEFT: 0.18.0
TRL: Latest
Accelerate: 1.2.1

Limitations

⚠️ Technical Limitations:

Model Size: 360M parameters - smaller than most modern LLMs
Language: English only
Context: Optimized for Reddit-style comments
Format Dependency: Requires exact prompt format for best results
Edge Cases: May be conservative or lenient on borderline content

⚠️ Content & Ethical Considerations:

Human Oversight Required: Not a replacement for human moderators
Training Data Bias: May reflect biases in training data
Context Sensitivity: Cannot understand full conversation context
Evolving Standards: Reflects training data from specific time period
Community Variation: Reddit communities have diverse norms

⚠️ Safety Considerations:

Always involve human review for final moderation decisions
Pay special attention to high-severity classifications in:
- Self-harm categories (immediate intervention may be needed)
- Sexual/minors category (legal implications)
- Threatening categories (safety concerns)
Consider full context when making moderation decisions
Maintain appeals processes for users

Evaluation & Performance

The model has been tested on its core task of 11-category classification with the following characteristics:

Format Consistency: Reliably outputs all 11 categories
Toxicity Detection: Strong performance on clearly toxic content
Safe Content: Correctly identifies non-problematic content
Self-Harm Detection: Properly flags concerning self-harm content
Profanity Handling: Distinguishes between harmful and benign profanity use

Out-of-Scope Use

This model should NOT be used for:

Fully automated content removal without human review
Legal decisions or determinations
Non-Reddit platforms without additional fine-tuning
Personal harassment or targeted moderation
Bypassing platform terms of service
Real-time moderation without human oversight
Applications that could cause harm to individuals

Model Card

Model Type: Causal Language Model (Fine-tuned for Classification) Language: English License: Apache 2.0 Base Model: HuggingFaceTB/SmolLM2-360M-Instruct Adapter Type: LoRA Task: Multi-category content classification

Citation

If you use this model, please cite:

@misc{smollm2-reddit-moderator,
  author = {Your Name},
  title = {SmolLM2 Reddit Content Classifier},
  year = {2025},
  publisher = {HuggingFace},
  howpublished = {\url{https://huggingface.co/YOUR_USERNAME/smollm2-reddit-moderator}}
}

Acknowledgments

Base Model: HuggingFaceTB/SmolLM2-360M-Instruct
Dataset: ifmain/text-moderation-01
Training Framework: TRL
Fine-tuning Method: LoRA/PEFT

Disclaimer

This model is a research and development tool designed to assist human moderators, not replace them. The model's outputs are predictions and may contain errors.

Critical Content Warning: For content flagged as high-risk in the following categories, immediate human review and intervention is required:

Self-harm, self-harm instructions, or self-harm intent
Sexual content involving minors
Threatening harassment or hate speech
High violence or graphic violence

Always use this model responsibly and maintain human oversight in all moderation decisions.

Additional Resources

Downloads last month: 70

Model tree for yasmineelqorashy/reddit-moderator-model

Base model

HuggingFaceTB/SmolLM2-360M

Quantized

HuggingFaceTB/SmolLM2-360M-Instruct

Adapter

(20)

this model

yasmineelqorashy
/

reddit-moderator-model

SmolLM2 Reddit Content Classifier

Model Description

Categories & Severity Levels

Usage

Installation

Quick Start

Classification (IMPORTANT: Use Exact Format)

Expected Output Format

Examples

Example 1: Safe Content

Example 2: Toxic Content

Example 3: Mild Profanity

Parsing the Output

Important Configuration

Use Cases

Training Details

Dataset

Training Configuration

Framework Versions

Limitations

Evaluation & Performance

Out-of-Scope Use

Model Card

Citation

Acknowledgments

Disclaimer

Additional Resources

Model tree for yasmineelqorashy/reddit-moderator-model

Dataset used to train yasmineelqorashy/reddit-moderator-model