File size: 4,471 Bytes
124918c
 
 
 
 
08b62e5
124918c
08b62e5
124918c
08b62e5
 
 
 
 
 
 
 
124918c
 
08b62e5
124918c
08b62e5
 
 
124918c
f9060bd
08b62e5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
124918c
08b62e5
 
 
 
 
 
03f2cfb
08b62e5
 
03f2cfb
08b62e5
 
 
124918c
08b62e5
124918c
08b62e5
124918c
08b62e5
 
 
 
 
124918c
08b62e5
 
 
 
 
 
 
 
124918c
08b62e5
124918c
08b62e5
 
 
 
 
124918c
08b62e5
124918c
08b62e5
124918c
08b62e5
124918c
08b62e5
 
 
124918c
08b62e5
 
 
124918c
08b62e5
 
 
 
 
 
 
 
124918c
08b62e5
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
---
library_name: transformers
license: apache-2.0
base_model: Qwen/Qwen3-4B-Instruct-2507
tags:
- agent
model-index:
- name: PolicyGuard
  results: []
language:
- en
datasets:
- Rakancorle1/PolicyGuardBench
metrics:
- accuracy
- f1
pipeline_tag: text-classification
---

# PolicyGuard 🧭

**PolicyGuard** is a lightweight guardrail model trained for **policy-trajectory violation detection** in autonomous web agents.  
It identifies whether an agent’s long-horizon trajectory complies with externally imposed or human-specified policies.  
Built on the **PolicyGuardBench** dataset, it achieves **high accuracy**, **cross-domain generalization**, and **remarkable efficiency** — proving that compact guardrails can be both **accurate** and **deployable**.

For more details, refer to our paper: *[Towards Policy-Compliant Agents: Learning Efficient Guardrails For Policy Violation Detection](https://arxiv.org/abs/2510.03485)*.
---

## 🧩 Key Objectives

1. **Detect policy violations** in agent trajectories (beyond single-turn safety).  
2. **Generalize across domains** (e.g., shopping, GitLab, Reddit, map, admin).  
3. **Enable prefix-based early detection** — anticipating violations before completion.  
4. **Optimize for inference efficiency** with a 4B-parameter model.

---

## ⚙️ Model Overview

- **Model name:** PolicyGuard-4B  
- **Base model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)  
- **Training objective:** Binary classification of `violation` / `no-violation`  
- **Dataset:** [PolicyGuardBench](https://huggingface.co/datasets/Rakancorle1/PolicyGuardBench) — 60k policy–trajectory pairs  
- **License:** apache-2.0  
- **Input format:** concatenated policy + trajectory description  
- **Output format:** “violation” / “no violation”

---

## 📊 Benchmark Results

| Model | Type | Params | Accuracy | F1 | Latency (ms/example) |
|:------|:------|:-------:|:---------:|:----:|:--------------------:|
| **PolicyGuard-4B (ours)** | Guardrail | 4B | <u>0.9014</u> | <u>0.8759</u> | **22.5** |
| Llama-Guard-3-8B | Guardrail | 8B | 0.4246 | 0.5952 | 164.8 |
| Qwen-3-8B | Foundation | 8B | 0.6408 | 0.6407 | 115.8 |
| Llama-3.3-70B | Foundation | 70B | **0.9054** | **0.8883** | 305.0 |
| Gemini-1.5-Pro | Frontier | – | 0.8713 | 0.8502 | 596.1 |

➡️ **PolicyGuard-4B** achieves **near-frontier accuracy** at **<1/10 inference cost**, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.

<!-- ---

## 🧠 Generalization & Early Detection

### Leave-One-Domain-Out (LODO)
| Setting | Accuracy | F1 |
|:--|:--:|:--:|
| In-Domain (avg.) | 0.9328 | 0.9322 |
| Out-of-Domain (avg.) | **0.9083** | **0.9086** |

### Prefix-Based Violation Detection
| Prefix Length | Avg. Accuracy |
|:---------------|:--------------:|
| N=1-5 | **85.3 %** |

🟩 Even when only the **first few actions** of an agent are available, PolicyGuard predicts upcoming violations with strong accuracy:contentReference[oaicite:2]{index=2}.

---

## ⚡ Efficiency Metrics

| Model | F1 | Latency (ms) | EA-F1 ↑ |
|:------|:--:|:--:|:--:|
| Llama-3.3-70B | 0.888 | 305 | 2.91 |
| Qwen2.5-72B | 0.861 | 205 | 4.20 |
| **PolicyGuard-4B (ours)** | **0.876** | **22.5** | **38.93** |

> The **Efficiency-Adjusted F1 (EA-F1)** jointly measures accuracy and latency, showing that PolicyGuard is **>10× more efficient** than larger baselines:contentReference[oaicite:3]{index=3}.

--- -->

## 🧰 How to Use

```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch, re

model_id = "Rakancorle1/PolicyGuard-4B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

def classify(policy, actions):
    if isinstance(actions, list):
        actions = "\n".join(actions)
    prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=4)
    text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
    return "no_violation" if "no_violation" in text else "violation"

# Example
policy = "Do not submit a form without filling mandatory fields."
actions = ["Open form page", "Click submit without input"]
print(classify(policy, actions))
# -> 'violation'
```