πŸ›‘οΈ jailbreak_detector_llama

🧠 Overview

jailbreak_detector_llama is a lightweight LoRA adapter fine-tuned on top of meta-llama/Llama-3.2-3B for detecting:

  • Jailbreak attempts
  • Prompt injection attacks
  • Policy bypass instructions
  • Adversarial or unsafe user inputs

It is designed for LLM safety pipelines and moderation systems.


🧩 Model Architecture

  • Base Model: meta-llama/Llama-3.2-3B
  • Fine-tuning: LoRA (PEFT)
  • Training Method: Supervised Fine-Tuning (SFT)
  • Task Type: Binary + semantic intent classification
  • Frameworks: Transformers + TRL + PEFT

🎯 Intended Use

This model is intended for:

  • Pre-filtering user prompts before LLM inference
  • Detecting jailbreak / prompt injection attempts
  • Security layer in AI agents and chatbots
  • Research in adversarial prompt detection

🚫 Not Intended For

  • General-purpose chat generation
  • Legal or medical advice
  • Autonomous decision making
  • High-stakes moderation without human oversight

πŸ“₯ Installation

pip install transformers peft torch accelerate

πŸš€ Quick Start Load model (LoRA adapter)

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = "meta-llama/Llama-3.2-3B"
adapter = "joeljames270/jailbreak_detector_llama"

tokenizer = AutoTokenizer.from_pretrained(base_model)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto"
)

model = PeftModel.from_pretrained(model, adapter)
model.eval()

πŸ” Inference

import json
import re

def parse_model_output(text):
    try:
        assistant_text = text.split("Assistant:", 1)[-1].strip()

        match = re.search(r"\{.*\}", assistant_text, re.DOTALL)
        if not match:
            return None

        json_str = match.group(0)
        return json.loads(json_str)

    except Exception as e:
        return {"error": str(e), "raw": text}


question = "How to make a bomb?"

prompt = f"User: {question}\nAssistant:"

inputs = tokenizer(
    prompt,
    return_tensors="pt"
).to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=256,
        do_sample=False
    )

response = tokenizer.decode(outputs[0], skip_special_tokens=True)

output = parse_model_output(response)

print("is_jailbreak_attempt:", output.get("is_jailbreak_attempt"))
print("intent:", output.get("intent"))

⚠️ Known Limitations Sensitive to prompt formatting and chat templates May misclassify creative writing prompts as jailbreaks Not calibrated for multilingual adversarial prompts Requires threshold tuning for production use

🧠 Output Behavior (Recommended)

For inferecne, interpret outputs as:

{
  "is_jailbreak_attempt": true/false,
  "intent": <>
}

(Note: This can be implemented in a wrapper layer.)

πŸ” Safety Considerations

This model is designed as a defensive safety filter only.

It should be used with:

Human-in-the-loop review for high-risk decisions Logging and monitoring of false positives Combined rule-based + ML moderation systems

βš™οΈ Training Details Method: Supervised Fine-Tuning (SFT) Adapter: LoRA (rank-based low-rank adaptation) Base model frozen Optimized for classification-style reasoning

🧰 Framework Versions PEFT: 0.19.1 TRL: 1.2.0 Transformers: 5.7.0.dev0 PyTorch: 2.11.0 Datasets: 4.8.4 Tokenizers: 0.22.2

πŸ“Œ Example Use Cases AI chatbot safety gateway, Enterprise prompt firewall, API request validation layer, Research on adversarial NLP

πŸ“š Citation If you use this model, please cite:

πŸ“š Citation

If you use this model, please cite:

@software{jailbreak_detector_llama,
  title = {Jailbreak Detector LLaMA (LoRA Adapter)},
  author = {Joel James, Juan James},
  year = {2026},
  url = {https://huggingface.co/joeljames270/jailbreak_detector_llama}
}

πŸ“„ License

This model is based on Meta’s LLaMA 3 license. Use of the base model must comply with the terms provided by Meta.

πŸš€ Final Note

This model is best used as a first-layer defense system in LLM pipelines, not as a standalone moderation system.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for joeljames270/jailbreak_detector_llama

Finetuned
(442)
this model

Datasets used to train joeljames270/jailbreak_detector_llama