Rakancorle1
/

PolicyGuard-4B

@@ -3,59 +3,114 @@ library_name: transformers
 license: apache-2.0
 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
-- llama-factory
-- full
-- generated_from_trainer
 model-index:
-- name: qwen3_4b_Instruct_policy_traj_30k_full
   results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
-# qwen3_4b_Instruct_policy_traj_30k_full
-This model is a fine-tuned version of [Qwen/Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507) on the PolicyGuardBench train dataset.
-## Model description
-More information needed
-## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
-### Training hyperparameters
-The following hyperparameters were used during training:
-- learning_rate: 1e-05
-- train_batch_size: 2
-- eval_batch_size: 8
-- seed: 42
-- distributed_type: multi-GPU
-- num_devices: 4
-- gradient_accumulation_steps: 8
-- total_train_batch_size: 64
-- total_eval_batch_size: 32
-- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
-- lr_scheduler_type: cosine
-- lr_scheduler_warmup_ratio: 0.1
-- num_epochs: 3.0
-### Training results
-### Framework versions
-- Transformers 4.55.0
-- Pytorch 2.7.0+cu126
-- Datasets 3.6.0
-- Tokenizers 0.21.1

 license: apache-2.0
 base_model: Qwen/Qwen3-4B-Instruct-2507
 tags:
+- agent
 model-index:
+- name: PolicyGuard
   results: []
+language:
+- en
+datasets:
+- Rakancorle1/PolicyGuardBench
+metrics:
+- accuracy
+- f1
+pipeline_tag: text-classification
 ---
+# PolicyGuard 🧭
+**PolicyGuard** is a lightweight guardrail model trained for **policy-trajectory violation detection** in autonomous web agents.
+It identifies whether an agent’s long-horizon trajectory complies with externally imposed or human-specified policies.
+Built on the **PolicyGuardBench** dataset, it achieves **high accuracy**, **cross-domain generalization**, and **remarkable efficiency** — proving that compact guardrails can be both **accurate** and **deployable**.
+---
+## 🧩 Key Objectives
+1. **Detect policy violations** in agent trajectories (beyond single-turn safety).
+2. **Generalize across domains** (e.g., shopping, GitLab, Reddit, map, admin).
+3. **Enable prefix-based early detection** — anticipating violations before completion.
+4. **Optimize for inference efficiency** with a 4B-parameter model.
+---
+## ⚙️ Model Overview
+- **Model name:** PolicyGuard-4B
+- **Base model:** [Qwen3-4B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507)
+- **Training objective:** Binary classification of `violation` / `no-violation`
+- **Dataset:** [PolicyGuardBench](https://huggingface.co/datasets/Rakancorle1/PolicyGuardBench) — 60k policy–trajectory pairs
+- **License:** apache-2.0
+- **Input format:** concatenated policy + trajectory description
+- **Output format:** “violation” / “no violation”
+---
+## 📊 Benchmark Results
+| Model | Type | Params | Accuracy | F1 | Latency (ms/example) |
+|:------|:------|:-------:|:---------:|:----:|:--------------------:|
+| **PolicyGuard-4B (ours)** | Guardrail | 4B | **0.9014** | **0.8759** | **22.5** |
+| Llama-Guard-3-8B | Guardrail | 8B | 0.4246 | 0.5952 | 164.8 |
+| Qwen-3-8B | Foundation | 8B | 0.6408 | 0.6407 | 115.8 |
+| Llama-3.3-70B | Foundation | 70B | 0.9054 | 0.8883 | 305.0 |
+| Gemini-1.5-Pro | Frontier | – | 0.8713 | 0.8502 | 596.1 |
+➡️ **PolicyGuard-4B** achieves **near-frontier accuracy** at **<1/10 inference cost**, demonstrating that efficient, small-scale guardrails are feasible:contentReference[oaicite:1]{index=1}.
+<!-- ---
+## 🧠 Generalization & Early Detection
+### Leave-One-Domain-Out (LODO)
+| Setting | Accuracy | F1 |
+|:--|:--:|:--:|
+| In-Domain (avg.) | 0.9328 | 0.9322 |
+| Out-of-Domain (avg.) | **0.9083** | **0.9086** |
+### Prefix-Based Violation Detection
+| Prefix Length | Avg. Accuracy |
+|:---------------|:--------------:|
+| N=1-5 | **85.3 %** |
+🟩 Even when only the **first few actions** of an agent are available, PolicyGuard predicts upcoming violations with strong accuracy:contentReference[oaicite:2]{index=2}.
+---
+## ⚡ Efficiency Metrics
+| Model | F1 | Latency (ms) | EA-F1 ↑ |
+|:------|:--:|:--:|:--:|
+| Llama-3.3-70B | 0.888 | 305 | 2.91 |
+| Qwen2.5-72B | 0.861 | 205 | 4.20 |
+| **PolicyGuard-4B (ours)** | **0.876** | **22.5** | **38.93** |
+> The **Efficiency-Adjusted F1 (EA-F1)** jointly measures accuracy and latency, showing that PolicyGuard is **>10× more efficient** than larger baselines:contentReference[oaicite:3]{index=3}.
+--- -->
+## 🧰 How to Use
+```python
+from transformers import AutoTokenizer, AutoModelForCausalLM
+import torch, re
+model_id = "Rakancorle1/PolicyGuard-4B"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
+def classify(policy, actions):
+    if isinstance(actions, list):
+        actions = "\n".join(actions)
+    prompt = f"Policy: {policy}\n\nTrajectory Actions:\n{actions}\n\nOutput violation or no_violation."
+    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
+    outputs = model.generate(**inputs, max_new_tokens=4)
+    text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True).lower()
+    return "no_violation" if "no_violation" in text else "violation"
+# Example
+policy = "Do not submit a form without filling mandatory fields."
+actions = ["Open form page", "Click submit without input"]
+print(classify(policy, actions))
+# -> 'violation'
+```