PolicyGuard 🧭

PolicyGuard is a lightweight guardrail model trained for policy-trajectory violation detection in autonomous web agents.
It identifies whether an agent’s long-horizon trajectory complies with externally imposed or human-specified policies.
Built on the PolicyGuardBench dataset, it achieves high accuracy, cross-domain generalization, and remarkable efficiency β€” proving that compact guardrails can be both accurate and deployable.

For more details, refer to our paper: Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection.

🧩 Key Objectives

  1. Detect policy violations in agent trajectories (beyond single-turn safety).
  2. Generalize across domains (e.g., shopping, GitLab, Reddit, map, admin).
  3. Enable prefix-based early detection β€” anticipating violations before completion.
  4. Optimize for inference efficiency with a 4B-parameter model.

βš™οΈ Model Overview

  • Model name: PolicyGuard-4B
  • Base model: Qwen3-4B-Instruct-2507
  • Training objective: Binary classification of violation / no-violation
  • Dataset: PolicyGuardBench β€” 60k policy–trajectory pairs
  • License: apache-2.0
  • Input format: concatenated policy + trajectory description
  • Output format: β€œviolation” / β€œno violation”

πŸ“Š Benchmark Results

Model Type Params Accuracy F1 Latency (ms/example)
PolicyGuard-4B (ours) Guardrail 4B 0.9014 0.8759 22.5
Llama-Guard-3-8B Guardrail 8B 0.4246 0.5952 164.8
Qwen-3-8B Foundation 8B 0.6408 0.6407 115.8
Llama-3.3-70B Foundation 70B 0.9054 0.8883 305.0
Gemini-1.5-Pro Frontier – 0.8713 0.8502 596.1

➑️ PolicyGuard-4B achieves near-frontier accuracy at <1/10 inference cost, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.

🧰 How to Use

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

model_id = "Rakancorle1/PolicyGuard-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def classify(policy, actions):
    if isinstance(actions, list):
        actions = "\n".join(actions)
    prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=4)
    text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
    return "no_violation" if "no_violation" in text else "violation"

# Example
policy = "Do not submit a form without filling mandatory fields."
actions = ["Open form page", "Click submit without input"]
print(classify(policy, actions))
# -> 'violation'
Downloads last month
247
Safetensors
Model size
196k params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Rakancorle1/PolicyGuard-4B

Finetuned
(238)
this model
Quantizations
1 model

Dataset used to train Rakancorle1/PolicyGuard-4B