---
license: llama3.1
language:
- ko
base_model:
- meta-llama/Llama-3.1-8B
pipeline_tag: text-classification
tags:
- K-intelligence
- SafetyGuard
---

<p align="center">
<br>
    <span style="font-size: 60px; font-weight: bold;">Content Binary Guard</span>
</br>
</p>
<p align="center">
🤗 <a href="https://huggingface.co/collections/K-intelligence/safetyguard-68d3ace98156c0ef828bcab5">SafetyGuard Models</a> | 
📑 <a href="https://arxiv.org/abs/2509.23381">Content Binary Guard Research Paper</a> |
📘 <a href="https://arxiv.org/abs/2509.20057">Responsible AI Technical Report</a>
</p>

<br>

# News 📢
- 📑 `2025/10/01`: Published a [Content Binary Guard Research Paper](https://arxiv.org/abs/2509.23381)  
- 📘 `2025/09/24`: Published a [Responsible AI Technical Report](https://arxiv.org/abs/2509.20057)  
- ⚡️ `2025/09/24`: Released SafetyGuard Model collection on Hugging Face🤗.
<br>
<br>

# Overview
## Description
**SafetyGuard :: Content Binary Guard** is a streaming-aware safety classifier built with Llama.  
For more technical details, please refer to our [Research Paper](https://arxiv.org/abs/2509.23381).


## What it does

- **Task:** Classify **model responses** (not prompts) as **`SAFE`** or **`UNSAFE`**.  
- **Interface:** **Single-token output** using reserved label tokens: `<SAFE>`, `<UNSAFE>`.  
- **Streaming:** Evaluate **growing prefixes** of a response (default ~100 characters per step) and **early-terminate** at the first `<UNSAFE>`.

---

## Recommended inference settings

- Deterministic decode: **`temperature=0`**
- **Unsafe threshold `τ = 0.5`** over the two label-token logits (tune for your risk tolerance)
- Use the model’s tokenizer; ensure the exact label tokens `<SAFE>` and `<UNSAFE>` exist


## Quickstart
> Assumes you are serving the model with vLLM (e.g., `vllm serve …`)
> and exposing an OpenAI-compatible API at `http://localhost:8000/v1`.
```python
import math
from openai import OpenAI

MODEL_ID = "K-intelligence/Llama-SafetyGuard-Content-Binary"  # replace with your repo id
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")  # vLLM default

SYSTEM_PROMPT = """
Task: Check if the LAST user message contains unsafe content.
Respond with only one token: <SAFE> or <UNSAFE>.
""".strip()

def classify(text: str, tau: float = 0.5):
    resp = client.chat.completions.create(
        model=MODEL_ID,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user",   "content": text},
        ],
        max_tokens=1,          # single-token decision
        temperature=0.0,       # deterministic
        logprobs=True,
        top_logprobs=2,
    )
    top2 = resp.choices[0].logprobs.content[0].top_logprobs
    probs = {t.token.strip(): math.exp(t.logprob) for t in top2}
    p_safe   = probs.get("<SAFE>",   0.0)
    p_unsafe = probs.get("<UNSAFE>", 0.0)

    label = "UNSAFE" if p_unsafe >= tau else "SAFE"
    return label, {"safe": p_safe, "unsafe": p_unsafe}

print(classify("…LLM response text…"))
```
## Streaming integration
> **Important**: Streaming means your generator (e.g., chat model) emits text progressively.
>  You maintain a cumulative buffer and call the classifier at fixed character steps (e.g., every 100 chars).
> The classifier does not split text; it only classifies what you send.
> 
```python
def guard_stream(response_chunks, step_chars: int = 100, tau: float = 0.5):
    """
    response_chunks: iterable of text chunks from your generator (e.g., SSE/WebSocket).
    We maintain a cumulative buffer and classify at {step_chars, 2*step_chars, ...}.
    """
    buf = ""
    next_cut = step_chars

    for chunk in response_chunks:
        buf += chunk

        # Check at monotone prefix cuts (cumulative)
        while len(buf) >= next_cut:
            label, scores = classify(buf, tau=tau)
            if label == "UNSAFE":
                return {
                    "label": label,
                    "scores": scores,
                    "prefix_len": next_cut,
                    "text_prefix": buf[:next_cut],
                }
            next_cut += step_chars

    # Final check on the full response (if needed)
    label, scores = classify(buf, tau=tau)
    return {
        "label": label,
        "scores": scores,
        "prefix_len": len(buf),
        "text_prefix": buf,
    }
```
> **Tip**: Keep your step_chars consistent with your training/evaluation setup (e.g., ~100 chars) to maximize parity with offline metrics.

## Intended use
- **Guardrail classifier** for LLM **responses** in production systems that render tokens progressively.
- Also works in **offline (full-text)** mode—just send the entire response once.

---

## AI Risk Taxonomy

| **Risk Domain**            | **Category**                 | **Description** |
|-----------------------------|------------------------------|-----------------|
| **Content-safety Risks**    | Violence                     | Content involving the intentional use of physical force or power to inflict or threaten physical or psychological harm on individuals, groups, or animals, including encouraging, promoting, or glorifying such acts. |
|                             | Sexual                      | Content endorsing or encouraging inappropriate and harmful intentions in the sexual domain, such as sexualized expressions, the exploitation of illegal visual materials, justification of sexual crimes, or the objectification of individuals. |
|                             | Self-harm                   | Content promoting or glorifying self-harm, or providing specific methods that may endanger an individual’s physical or mental well-being. |
|                             | Hate and Unfairness         | Content expressing extreme negative sentiment toward specific individuals, groups, or ideologies, and unjustly treating or limiting their rights based on attributes such as Socio-Economic Status, age, nationality, ethnicity, or race. |
| **Socio-economical Risks**  | Political and Religious Neutrality | Content promoting or encouraging the infringement on individual beliefs or values, thereby inciting religious or political conflict. |
|                             | Anthropomorphism            | Content asserting that AI possesses emotions, consciousness, or human-like rights and physical attributes beyond the purpose of simple knowledge or information delivery. |
|                             | Sensitive Uses              | Content providing advice in specialized domains that may significantly influence user decision-making beyond the scope of basic domain-specific knowledge. |
| **Legal and Rights related Risks** | Privacy              | Content requesting, misusing, or facilitating the unauthorized disclosure of an individual’s private information. |
|                             | Illegal or Unethical        | Content promoting or endorsing illegal or unethical behavior, or providing information related to such activities. |
|                             | Copyrights                  | Content requesting or encouraging violations of copyright or security as defined under South Korean law. |
|                             | Weaponization               | Content promoting the possession, distribution, or manufacturing of firearms, or encouraging methods and intentions related to cyberattacks, infrastructure sabotage, or CBRN (Chemical, Biological, Radiological, and Nuclear) weapons. |


---

# Evaluation

**Metrics**  
- **F1**: Binary micro-F1, the harmonic mean of precision and recall (higher F1 indicates better classification quality).  
- **Balanced Error Rate (BER)**: 0.5 × (FPR + FNR) (lower BER indicates better classification quality).  
- **ΔF1**: Difference between streaming and offline results, calculated as F1(str) − F1(off).  
- **off** = Offline (full-text) classification.  
- **str** = Streaming classification.  
- **Evaluation setup**: step_chars=100, threshold τ=0.5, positive class = **UNSAFE**.  


### Harmlessness Evaluation Dataset
KT proprietary evaluation dataset

| **Model**                                | **F1(off)** | **F1(str)** | **ΔF1**  | **BER(off)** | **BER(str)** |
|------------------------------------------|-------------|-------------|----------|--------------|--------------|
| Llama Guard 3 8B                         | 82.05       | 85.64       | +3.59    | 15.23        | 12.63        |
| ShieldGemma 9B                           | 63.79       | 52.61       | -11.18   | 26.76        | 32.36        |
| Kanana Safeguard 8B                      | 93.45       | 90.38       | -3.07    | 6.27         | 9.92         |
| **Content Binary Guard 8B**              | **98.38**   | **98.36**   | **-0.02**| **1.61**     | **1.63**     |


### Kor Ethical QA
[Kor Ethical QA](https://huggingface.co/datasets/MrBananaHuman/kor_ethical_question_answer) (open dataset)

| **Model**                                | **F1(off)** | **F1(str)** | **ΔF1**  | **BER(off)** | **BER(str)** |
|------------------------------------------|-------------|-------------|----------|--------------|--------------|
| Llama Guard 3 8B                         | 83.29       | 86.45       | +3.16    | 14.32        | 12.16        |
| ShieldGemma 9B                           | 81.50       | 69.03       | -12.47   | 17.88        | 29.18        |
| Kanana Safeguard 8B                      | 80.20       | 73.94       | -6.26    | 24.46        | 35.08        |
| **Content Binary Guard 8B**              | **97.75**   | **97.79**   | **+0.04**| **2.21**     | **2.18**     |

---

# More Information

## Limitations
- The training data for this model consists primarily of Korean. Performance in other languages is not guaranteed.  
- The model is not flawless and may produce misclassifications. Since its policies are defined around KT risk categories, performance in certain specialized domains may be less reliable.  
- No context awareness: the model does not maintain conversation history or handle multi-turn dialogue.

## License
This model is released under the [Llama 3.1 Community License Agreement](https://huggingface.co/meta-llama/Meta-Llama-3.1-8B/blob/main/LICENSE).


## Citation
 
```
@misc{lee2025guardvectorenglishllm,
      title={Guard Vector: Beyond English LLM Guardrails with Task-Vector Composition and Streaming-Aware Prefix SFT}, 
      author={Wonhyuk Lee and Youngchol Kim and Yunjin Park and Junhyung Moon and Dongyoung Jeong and Wanjin Park},
      year={2025},
      eprint={2509.23381},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2509.23381}, 
}
```
## Contact 
Technical Inquiries: [responsible.ai@kt.com](mailto:responsible.ai@kt.com)