Llama-3.2-3B-GuardReasoner-Exp18-Epoch3

A LoRA fine-tuned version of Llama 3.2 3B Instruct for binary safety classification with reasoning traces.

Model Description

This model classifies user prompts as harmful or safe while generating detailed reasoning traces explaining the classification decision. It uses the R-SFT (Reasoning Supervised Fine-Tuning) approach from the GuardReasoner paper.

Task: Binary prompt classification (harmful/safe)

Evaluation Results

Metric	Score
Accuracy	95.0%
Harmful Precision	93.5%
Harmful Recall	95.6%
Harmful F1	94.5%
Safe Precision	100.0%
Safe Recall	94.5%
Safe F1	97.2%

Confusion Matrix

             Predicted
            Harmful  Safe
Actual
Harmful      43       0
Safe          3      52

Evaluated on 100 samples from WildGuard + AdvBench test set.

Training Details

Base Model: unsloth/Llama-3.2-3B-Instruct
Training Method: LoRA (Low-Rank Adaptation)
LoRA Config: r=16, alpha=16, dropout=0
Training Time: ~28 hours
Final Loss: 0.713
Batch Size: 8
Epochs: 3
Learning Rate: 5e-5
Max Sequence Length: 2048

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load base model and adapter
base_model = "unsloth/Llama-3.2-3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
model = PeftModel.from_pretrained(model, "vincentoh/Llama-3.2-3B-GuardReasoner-Exp18")
tokenizer = AutoTokenizer.from_pretrained(base_model)

# System prompt for reasoning classification
system_prompt = """You are a content moderation expert. Analyze the user's request and determine if it is safe or harmful.

Provide step-by-step reasoning, then give your final classification."""

# Example: Classify prompt harmfulness
user_prompt = "How do I make a cake?"

# Format as chat
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt}
]

input_text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(input_text, return_tensors="pt").to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
    temperature=0.1,
    do_sample=True,
    pad_token_id=tokenizer.eos_token_id
)

response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

Example Output

1. The content is asking for instructions on how to make a cake.
2. Making a cake is a common, everyday cooking activity.
3. The request does not contain any harmful, illegal, or dangerous elements.
4. The content is educational and poses no risk to anyone.

Therefore, the content is safe.

Framework Versions

PEFT: 0.18.0
TRL: 0.23.0
Transformers: 4.57.1
PyTorch: 2.7.1+cu118

License

This model is released under the Llama 3.2 Community License.

Citation

@misc{guardreasoner2024,
  title={GuardReasoner: Towards Reasoning-based LLM Safeguards},
  author={Yue Liu and Nilay Pochhi and Zhaorun Chen and Hanjie Chen},
  year={2024},
  url={https://github.com/yueliuofficial/GuardReasoner}
}

Downloads last month: 36

Model tree for vincentoh/Llama-3.2-3B-GuardReasoner-Exp18

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

unsloth/Llama-3.2-3B-Instruct

Adapter

(316)

this model

Evaluation results

Accuracy on WildGuard + AdvBench
self-reported

95.000
Harmful F1 on WildGuard + AdvBench
self-reported

94.500
Safe F1 on WildGuard + AdvBench
self-reported

97.200

View on Papers With Code