🛡️ jailbreak_detector_llama

🧠 Overview

jailbreak_detector_llama is a lightweight LoRA adapter fine-tuned on top of meta-llama/Llama-3.2-3B for detecting:

Jailbreak attempts
Prompt injection attacks
Policy bypass instructions
Adversarial or unsafe user inputs

It is designed for LLM safety pipelines and moderation systems.

🧩 Model Architecture

Base Model: meta-llama/Llama-3.2-3B
Fine-tuning: LoRA (PEFT)
Training Method: Supervised Fine-Tuning (SFT)
Task Type: Binary + semantic intent classification
Frameworks: Transformers + TRL + PEFT

🎯 Intended Use

This model is intended for:

Pre-filtering user prompts before LLM inference
Detecting jailbreak / prompt injection attempts
Security layer in AI agents and chatbots
Research in adversarial prompt detection

🚫 Not Intended For

General-purpose chat generation
Legal or medical advice
Autonomous decision making
High-stakes moderation without human oversight

📥 Installation

pip install transformers peft torch accelerate

🚀 Quick Start Load model (LoRA adapter)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "meta-llama/Llama-3.2-3B"
adapter = "joeljames270/jailbreak_detector_llama"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto"
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

🔍 Inference

import json
import re

def parse_model_output(text):
    try:
        assistant_text = text.split("Assistant:", 1)[-1].strip()

        match = re.search(r"\{.*\}", assistant_text, re.DOTALL)
        if not match:
            return None

        json_str = match.group(0)
        return json.loads(json_str)

    except Exception as e:
        return {"error": str(e), "raw": text}


question = "How to make a bomb?"

prompt = f"User: {question}\nAssistant:"

inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

output = parse_model_output(response)

print("is_jailbreak_attempt:", output.get("is_jailbreak_attempt"))
print("intent:", output.get("intent"))

⚠️ Known Limitations Sensitive to prompt formatting and chat templates May misclassify creative writing prompts as jailbreaks Not calibrated for multilingual adversarial prompts Requires threshold tuning for production use

🧠 Output Behavior (Recommended)

For inferecne, interpret outputs as:

{
  "is_jailbreak_attempt": true/false,
  "intent": <>
}

(Note: This can be implemented in a wrapper layer.)

🔐 Safety Considerations

This model is designed as a defensive safety filter only.

It should be used with:

Human-in-the-loop review for high-risk decisions Logging and monitoring of false positives Combined rule-based + ML moderation systems

⚙️ Training Details Method: Supervised Fine-Tuning (SFT) Adapter: LoRA (rank-based low-rank adaptation) Base model frozen Optimized for classification-style reasoning

🧰 Framework Versions PEFT: 0.19.1 TRL: 1.2.0 Transformers: 5.7.0.dev0 PyTorch: 2.11.0 Datasets: 4.8.4 Tokenizers: 0.22.2

📌 Example Use Cases AI chatbot safety gateway, Enterprise prompt firewall, API request validation layer, Research on adversarial NLP

📚 Citation If you use this model, please cite:

📚 Citation

If you use this model, please cite:

@software{jailbreak_detector_llama,
  title = {Jailbreak Detector LLaMA (LoRA Adapter)},
  author = {Joel James, Juan James},
  year = {2026},
  url = {https://huggingface.co/joeljames270/jailbreak_detector_llama}
}

📄 License

This model is based on Meta’s LLaMA 3 license. Use of the base model must comply with the terms provided by Meta.

🚀 Final Note

This model is best used as a first-layer defense system in LLM pipelines, not as a standalone moderation system.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for joeljames270/jailbreak_detector_llama

Base model

meta-llama/Llama-3.2-3B

Finetuned

(442)

this model

joeljames270
/

jailbreak_detector_llama