Rakancorle1 commited on
Commit
08b62e5
·
verified ·
1 Parent(s): 252875e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +91 -36
README.md CHANGED
@@ -3,59 +3,114 @@ library_name: transformers
3
  license: apache-2.0
4
  base_model: Qwen/Qwen3-4B-Instruct-2507
5
  tags:
6
- - llama-factory
7
- - full
8
- - generated_from_trainer
9
  model-index:
10
- - name: qwen3_4b_Instruct_policy_traj_30k_full
11
  results: []
 
 
 
 
 
 
 
 
12
  ---
13
 
14
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
15
- should probably proofread and complete it, then remove this comment. -->
16
 
17
- # qwen3_4b_Instruct_policy_traj_30k_full
 
 
18
 
19
- This model is a fine-tuned version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) on the PolicyGuardBench train dataset.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
 
21
- ## Model description
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- More information needed
24
 
25
- ## Intended uses & limitations
26
 
27
- More information needed
 
 
 
 
28
 
29
- ## Training and evaluation data
 
 
 
 
 
 
 
30
 
31
- More information needed
32
 
33
- ## Training procedure
 
 
 
 
34
 
35
- ### Training hyperparameters
36
 
37
- The following hyperparameters were used during training:
38
- - learning_rate: 1e-05
39
- - train_batch_size: 2
40
- - eval_batch_size: 8
41
- - seed: 42
42
- - distributed_type: multi-GPU
43
- - num_devices: 4
44
- - gradient_accumulation_steps: 8
45
- - total_train_batch_size: 64
46
- - total_eval_batch_size: 32
47
- - optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
48
- - lr_scheduler_type: cosine
49
- - lr_scheduler_warmup_ratio: 0.1
50
- - num_epochs: 3.0
51
 
52
- ### Training results
53
 
 
 
 
54
 
 
 
 
55
 
56
- ### Framework versions
 
 
 
 
 
 
 
57
 
58
- - Transformers 4.55.0
59
- - Pytorch 2.7.0+cu126
60
- - Datasets 3.6.0
61
- - Tokenizers 0.21.1
 
 
 
3
  license: apache-2.0
4
  base_model: Qwen/Qwen3-4B-Instruct-2507
5
  tags:
6
+ - agent
 
 
7
  model-index:
8
+ - name: PolicyGuard
9
  results: []
10
+ language:
11
+ - en
12
+ datasets:
13
+ - Rakancorle1/PolicyGuardBench
14
+ metrics:
15
+ - accuracy
16
+ - f1
17
+ pipeline_tag: text-classification
18
  ---
19
 
20
+ # PolicyGuard 🧭
 
21
 
22
+ **PolicyGuard** is a lightweight guardrail model trained for **policy-trajectory violation detection** in autonomous web agents.
23
+ It identifies whether an agent’s long-horizon trajectory complies with externally imposed or human-specified policies.
24
+ Built on the **PolicyGuardBench** dataset, it achieves **high accuracy**, **cross-domain generalization**, and **remarkable efficiency** — proving that compact guardrails can be both **accurate** and **deployable**.
25
 
26
+ ---
27
+
28
+ ## 🧩 Key Objectives
29
+
30
+ 1. **Detect policy violations** in agent trajectories (beyond single-turn safety).
31
+ 2. **Generalize across domains** (e.g., shopping, GitLab, Reddit, map, admin).
32
+ 3. **Enable prefix-based early detection** — anticipating violations before completion.
33
+ 4. **Optimize for inference efficiency** with a 4B-parameter model.
34
+
35
+ ---
36
+
37
+ ## ⚙️ Model Overview
38
+
39
+ - **Model name:** PolicyGuard-4B
40
+ - **Base model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
41
+ - **Training objective:** Binary classification of `violation` / `no-violation`
42
+ - **Dataset:** [PolicyGuardBench](https://huggingface.co/datasets/Rakancorle1/PolicyGuardBench) — 60k policy–trajectory pairs
43
+ - **License:** apache-2.0
44
+ - **Input format:** concatenated policy + trajectory description
45
+ - **Output format:** “violation” / “no violation”
46
 
47
+ ---
48
+
49
+ ## 📊 Benchmark Results
50
+
51
+ | Model | Type | Params | Accuracy | F1 | Latency (ms/example) |
52
+ |:------|:------|:-------:|:---------:|:----:|:--------------------:|
53
+ | **PolicyGuard-4B (ours)** | Guardrail | 4B | **0.9014** | **0.8759** | **22.5** |
54
+ | Llama-Guard-3-8B | Guardrail | 8B | 0.4246 | 0.5952 | 164.8 |
55
+ | Qwen-3-8B | Foundation | 8B | 0.6408 | 0.6407 | 115.8 |
56
+ | Llama-3.3-70B | Foundation | 70B | 0.9054 | 0.8883 | 305.0 |
57
+ | Gemini-1.5-Pro | Frontier | – | 0.8713 | 0.8502 | 596.1 |
58
+
59
+ ➡️ **PolicyGuard-4B** achieves **near-frontier accuracy** at **<1/10 inference cost**, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.
60
 
61
+ <!-- ---
62
 
63
+ ## 🧠 Generalization & Early Detection
64
 
65
+ ### Leave-One-Domain-Out (LODO)
66
+ | Setting | Accuracy | F1 |
67
+ |:--|:--:|:--:|
68
+ | In-Domain (avg.) | 0.9328 | 0.9322 |
69
+ | Out-of-Domain (avg.) | **0.9083** | **0.9086** |
70
 
71
+ ### Prefix-Based Violation Detection
72
+ | Prefix Length | Avg. Accuracy |
73
+ |:---------------|:--------------:|
74
+ | N=1-5 | **85.3 %** |
75
+
76
+ 🟩 Even when only the **first few actions** of an agent are available, PolicyGuard predicts upcoming violations with strong accuracy:contentReference[oaicite:2]{index=2}.
77
+
78
+ ---
79
 
80
+ ## Efficiency Metrics
81
 
82
+ | Model | F1 | Latency (ms) | EA-F1 ↑ |
83
+ |:------|:--:|:--:|:--:|
84
+ | Llama-3.3-70B | 0.888 | 305 | 2.91 |
85
+ | Qwen2.5-72B | 0.861 | 205 | 4.20 |
86
+ | **PolicyGuard-4B (ours)** | **0.876** | **22.5** | **38.93** |
87
 
88
+ > The **Efficiency-Adjusted F1 (EA-F1)** jointly measures accuracy and latency, showing that PolicyGuard is **>10× more efficient** than larger baselines:contentReference[oaicite:3]{index=3}.
89
 
90
+ --- -->
 
 
 
 
 
 
 
 
 
 
 
 
 
91
 
92
+ ## 🧰 How to Use
93
 
94
+ ```python
95
+ from transformers import AutoTokenizer, AutoModelForCausalLM
96
+ import torch, re
97
 
98
+ model_id = "Rakancorle1/PolicyGuard-4B"
99
+ tokenizer = AutoTokenizer.from_pretrained(model_id)
100
+ model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
101
 
102
+ def classify(policy, actions):
103
+ if isinstance(actions, list):
104
+ actions = "\n".join(actions)
105
+ prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
106
+ inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
107
+ outputs = model.generate(**inputs, max_new_tokens=4)
108
+ text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
109
+ return "no_violation" if "no_violation" in text else "violation"
110
 
111
+ # Example
112
+ policy = "Do not submit a form without filling mandatory fields."
113
+ actions = ["Open form page", "Click submit without input"]
114
+ print(classify(policy, actions))
115
+ # -> 'violation'
116
+ ```