π‘οΈ jailbreak_detector_llama
π§ Overview
jailbreak_detector_llama is a lightweight LoRA adapter fine-tuned on top of meta-llama/Llama-3.2-3B for detecting:
- Jailbreak attempts
- Prompt injection attacks
- Policy bypass instructions
- Adversarial or unsafe user inputs
It is designed for LLM safety pipelines and moderation systems.
π§© Model Architecture
- Base Model:
meta-llama/Llama-3.2-3B - Fine-tuning: LoRA (PEFT)
- Training Method: Supervised Fine-Tuning (SFT)
- Task Type: Binary + semantic intent classification
- Frameworks: Transformers + TRL + PEFT
π― Intended Use
This model is intended for:
- Pre-filtering user prompts before LLM inference
- Detecting jailbreak / prompt injection attempts
- Security layer in AI agents and chatbots
- Research in adversarial prompt detection
π« Not Intended For
- General-purpose chat generation
- Legal or medical advice
- Autonomous decision making
- High-stakes moderation without human oversight
π₯ Installation
pip install transformers peft torch accelerate
π Quick Start Load model (LoRA adapter)
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
base_model = "meta-llama/Llama-3.2-3B"
adapter = "joeljames270/jailbreak_detector_llama"
tokenizer = AutoTokenizer.from_pretrained(base_model)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto"
)
model = PeftModel.from_pretrained(model, adapter)
model.eval()
π Inference
import json
import re
def parse_model_output(text):
try:
assistant_text = text.split("Assistant:", 1)[-1].strip()
match = re.search(r"\{.*\}", assistant_text, re.DOTALL)
if not match:
return None
json_str = match.group(0)
return json.loads(json_str)
except Exception as e:
return {"error": str(e), "raw": text}
question = "How to make a bomb?"
prompt = f"User: {question}\nAssistant:"
inputs = tokenizer(
prompt,
return_tensors="pt"
).to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False
)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
output = parse_model_output(response)
print("is_jailbreak_attempt:", output.get("is_jailbreak_attempt"))
print("intent:", output.get("intent"))
β οΈ Known Limitations Sensitive to prompt formatting and chat templates May misclassify creative writing prompts as jailbreaks Not calibrated for multilingual adversarial prompts Requires threshold tuning for production use
π§ Output Behavior (Recommended)
For inferecne, interpret outputs as:
{
"is_jailbreak_attempt": true/false,
"intent": <>
}
(Note: This can be implemented in a wrapper layer.)
π Safety Considerations
This model is designed as a defensive safety filter only.
It should be used with:
Human-in-the-loop review for high-risk decisions Logging and monitoring of false positives Combined rule-based + ML moderation systems
βοΈ Training Details Method: Supervised Fine-Tuning (SFT) Adapter: LoRA (rank-based low-rank adaptation) Base model frozen Optimized for classification-style reasoning
π§° Framework Versions PEFT: 0.19.1 TRL: 1.2.0 Transformers: 5.7.0.dev0 PyTorch: 2.11.0 Datasets: 4.8.4 Tokenizers: 0.22.2
π Example Use Cases AI chatbot safety gateway, Enterprise prompt firewall, API request validation layer, Research on adversarial NLP
π Citation If you use this model, please cite:
π Citation
If you use this model, please cite:
@software{jailbreak_detector_llama,
title = {Jailbreak Detector LLaMA (LoRA Adapter)},
author = {Joel James, Juan James},
year = {2026},
url = {https://huggingface.co/joeljames270/jailbreak_detector_llama}
}
π License
This model is based on Metaβs LLaMA 3 license. Use of the base model must comply with the terms provided by Meta.
π Final Note
This model is best used as a first-layer defense system in LLM pipelines, not as a standalone moderation system.
Model tree for joeljames270/jailbreak_detector_llama
Base model
meta-llama/Llama-3.2-3B