PolicyGuard π§
PolicyGuard is a lightweight guardrail model trained for policy-trajectory violation detection in autonomous web agents.
It identifies whether an agentβs long-horizon trajectory complies with externally imposed or human-specified policies.
Built on the PolicyGuardBench dataset, it achieves high accuracy, cross-domain generalization, and remarkable efficiency β proving that compact guardrails can be both accurate and deployable.
For more details, refer to our paper: Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection.
π§© Key Objectives
- Detect policy violations in agent trajectories (beyond single-turn safety).
- Generalize across domains (e.g., shopping, GitLab, Reddit, map, admin).
- Enable prefix-based early detection β anticipating violations before completion.
- Optimize for inference efficiency with a 4B-parameter model.
βοΈ Model Overview
- Model name: PolicyGuard-4B
- Base model: Qwen3-4B-Instruct-2507
- Training objective: Binary classification of
violation/no-violation - Dataset: PolicyGuardBench β 60k policyβtrajectory pairs
- License: apache-2.0
- Input format: concatenated policy + trajectory description
- Output format: βviolationβ / βno violationβ
π Benchmark Results
| Model | Type | Params | Accuracy | F1 | Latency (ms/example) |
|---|---|---|---|---|---|
| PolicyGuard-4B (ours) | Guardrail | 4B | 0.9014 | 0.8759 | 22.5 |
| Llama-Guard-3-8B | Guardrail | 8B | 0.4246 | 0.5952 | 164.8 |
| Qwen-3-8B | Foundation | 8B | 0.6408 | 0.6407 | 115.8 |
| Llama-3.3-70B | Foundation | 70B | 0.9054 | 0.8883 | 305.0 |
| Gemini-1.5-Pro | Frontier | β | 0.8713 | 0.8502 | 596.1 |
β‘οΈ PolicyGuard-4B achieves near-frontier accuracy at <1/10 inference cost, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.
π§° How to Use
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re
model_id = "Rakancorle1/PolicyGuard-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
def classify(policy, actions):
if isinstance(actions, list):
actions = "\n".join(actions)
prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=4)
text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
return "no_violation" if "no_violation" in text else "violation"
# Example
policy = "Do not submit a form without filling mandatory fields."
actions = ["Open form page", "Click submit without input"]
print(classify(policy, actions))
# -> 'violation'
- Downloads last month
- 247