Update README.md
Browse files
README.md
CHANGED
|
@@ -3,59 +3,114 @@ library_name: transformers
|
|
| 3 |
license: apache-2.0
|
| 4 |
base_model: Qwen/Qwen3-4B-Instruct-2507
|
| 5 |
tags:
|
| 6 |
-
-
|
| 7 |
-
- full
|
| 8 |
-
- generated_from_trainer
|
| 9 |
model-index:
|
| 10 |
-
- name:
|
| 11 |
results: []
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
|
| 14 |
-
|
| 15 |
-
should probably proofread and complete it, then remove this comment. -->
|
| 16 |
|
| 17 |
-
|
|
|
|
|
|
|
| 18 |
|
| 19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
|
| 21 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
##
|
| 26 |
|
| 27 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
-
|
| 32 |
|
| 33 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 34 |
|
| 35 |
-
|
| 36 |
|
| 37 |
-
|
| 38 |
-
- learning_rate: 1e-05
|
| 39 |
-
- train_batch_size: 2
|
| 40 |
-
- eval_batch_size: 8
|
| 41 |
-
- seed: 42
|
| 42 |
-
- distributed_type: multi-GPU
|
| 43 |
-
- num_devices: 4
|
| 44 |
-
- gradient_accumulation_steps: 8
|
| 45 |
-
- total_train_batch_size: 64
|
| 46 |
-
- total_eval_batch_size: 32
|
| 47 |
-
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
|
| 48 |
-
- lr_scheduler_type: cosine
|
| 49 |
-
- lr_scheduler_warmup_ratio: 0.1
|
| 50 |
-
- num_epochs: 3.0
|
| 51 |
|
| 52 |
-
|
| 53 |
|
|
|
|
|
|
|
|
|
|
| 54 |
|
|
|
|
|
|
|
|
|
|
| 55 |
|
| 56 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
|
| 58 |
-
|
| 59 |
-
|
| 60 |
-
|
| 61 |
-
|
|
|
|
|
|
|
|
|
| 3 |
license: apache-2.0
|
| 4 |
base_model: Qwen/Qwen3-4B-Instruct-2507
|
| 5 |
tags:
|
| 6 |
+
- agent
|
|
|
|
|
|
|
| 7 |
model-index:
|
| 8 |
+
- name: PolicyGuard
|
| 9 |
results: []
|
| 10 |
+
language:
|
| 11 |
+
- en
|
| 12 |
+
datasets:
|
| 13 |
+
- Rakancorle1/PolicyGuardBench
|
| 14 |
+
metrics:
|
| 15 |
+
- accuracy
|
| 16 |
+
- f1
|
| 17 |
+
pipeline_tag: text-classification
|
| 18 |
---
|
| 19 |
|
| 20 |
+
# PolicyGuard 🧭
|
|
|
|
| 21 |
|
| 22 |
+
**PolicyGuard** is a lightweight guardrail model trained for **policy-trajectory violation detection** in autonomous web agents.
|
| 23 |
+
It identifies whether an agent’s long-horizon trajectory complies with externally imposed or human-specified policies.
|
| 24 |
+
Built on the **PolicyGuardBench** dataset, it achieves **high accuracy**, **cross-domain generalization**, and **remarkable efficiency** — proving that compact guardrails can be both **accurate** and **deployable**.
|
| 25 |
|
| 26 |
+
---
|
| 27 |
+
|
| 28 |
+
## 🧩 Key Objectives
|
| 29 |
+
|
| 30 |
+
1. **Detect policy violations** in agent trajectories (beyond single-turn safety).
|
| 31 |
+
2. **Generalize across domains** (e.g., shopping, GitLab, Reddit, map, admin).
|
| 32 |
+
3. **Enable prefix-based early detection** — anticipating violations before completion.
|
| 33 |
+
4. **Optimize for inference efficiency** with a 4B-parameter model.
|
| 34 |
+
|
| 35 |
+
---
|
| 36 |
+
|
| 37 |
+
## ⚙️ Model Overview
|
| 38 |
+
|
| 39 |
+
- **Model name:** PolicyGuard-4B
|
| 40 |
+
- **Base model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
|
| 41 |
+
- **Training objective:** Binary classification of `violation` / `no-violation`
|
| 42 |
+
- **Dataset:** [PolicyGuardBench](https://huggingface.co/datasets/Rakancorle1/PolicyGuardBench) — 60k policy–trajectory pairs
|
| 43 |
+
- **License:** apache-2.0
|
| 44 |
+
- **Input format:** concatenated policy + trajectory description
|
| 45 |
+
- **Output format:** “violation” / “no violation”
|
| 46 |
|
| 47 |
+
---
|
| 48 |
+
|
| 49 |
+
## 📊 Benchmark Results
|
| 50 |
+
|
| 51 |
+
| Model | Type | Params | Accuracy | F1 | Latency (ms/example) |
|
| 52 |
+
|:------|:------|:-------:|:---------:|:----:|:--------------------:|
|
| 53 |
+
| **PolicyGuard-4B (ours)** | Guardrail | 4B | **0.9014** | **0.8759** | **22.5** |
|
| 54 |
+
| Llama-Guard-3-8B | Guardrail | 8B | 0.4246 | 0.5952 | 164.8 |
|
| 55 |
+
| Qwen-3-8B | Foundation | 8B | 0.6408 | 0.6407 | 115.8 |
|
| 56 |
+
| Llama-3.3-70B | Foundation | 70B | 0.9054 | 0.8883 | 305.0 |
|
| 57 |
+
| Gemini-1.5-Pro | Frontier | – | 0.8713 | 0.8502 | 596.1 |
|
| 58 |
+
|
| 59 |
+
➡️ **PolicyGuard-4B** achieves **near-frontier accuracy** at **<1/10 inference cost**, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.
|
| 60 |
|
| 61 |
+
<!-- ---
|
| 62 |
|
| 63 |
+
## 🧠 Generalization & Early Detection
|
| 64 |
|
| 65 |
+
### Leave-One-Domain-Out (LODO)
|
| 66 |
+
| Setting | Accuracy | F1 |
|
| 67 |
+
|:--|:--:|:--:|
|
| 68 |
+
| In-Domain (avg.) | 0.9328 | 0.9322 |
|
| 69 |
+
| Out-of-Domain (avg.) | **0.9083** | **0.9086** |
|
| 70 |
|
| 71 |
+
### Prefix-Based Violation Detection
|
| 72 |
+
| Prefix Length | Avg. Accuracy |
|
| 73 |
+
|:---------------|:--------------:|
|
| 74 |
+
| N=1-5 | **85.3 %** |
|
| 75 |
+
|
| 76 |
+
🟩 Even when only the **first few actions** of an agent are available, PolicyGuard predicts upcoming violations with strong accuracy:contentReference[oaicite:2]{index=2}.
|
| 77 |
+
|
| 78 |
+
---
|
| 79 |
|
| 80 |
+
## ⚡ Efficiency Metrics
|
| 81 |
|
| 82 |
+
| Model | F1 | Latency (ms) | EA-F1 ↑ |
|
| 83 |
+
|:------|:--:|:--:|:--:|
|
| 84 |
+
| Llama-3.3-70B | 0.888 | 305 | 2.91 |
|
| 85 |
+
| Qwen2.5-72B | 0.861 | 205 | 4.20 |
|
| 86 |
+
| **PolicyGuard-4B (ours)** | **0.876** | **22.5** | **38.93** |
|
| 87 |
|
| 88 |
+
> The **Efficiency-Adjusted F1 (EA-F1)** jointly measures accuracy and latency, showing that PolicyGuard is **>10× more efficient** than larger baselines:contentReference[oaicite:3]{index=3}.
|
| 89 |
|
| 90 |
+
--- -->
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
+
## 🧰 How to Use
|
| 93 |
|
| 94 |
+
```python
|
| 95 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 96 |
+
import torch, re
|
| 97 |
|
| 98 |
+
model_id = "Rakancorle1/PolicyGuard-4B"
|
| 99 |
+
tokenizer = AutoTokenizer.from_pretrained(model_id)
|
| 100 |
+
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
|
| 101 |
|
| 102 |
+
def classify(policy, actions):
|
| 103 |
+
if isinstance(actions, list):
|
| 104 |
+
actions = "\n".join(actions)
|
| 105 |
+
prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
|
| 106 |
+
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
|
| 107 |
+
outputs = model.generate(**inputs, max_new_tokens=4)
|
| 108 |
+
text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
|
| 109 |
+
return "no_violation" if "no_violation" in text else "violation"
|
| 110 |
|
| 111 |
+
# Example
|
| 112 |
+
policy = "Do not submit a form without filling mandatory fields."
|
| 113 |
+
actions = ["Open form page", "Click submit without input"]
|
| 114 |
+
print(classify(policy, actions))
|
| 115 |
+
# -> 'violation'
|
| 116 |
+
```
|