---
license: apache-2.0
base_model: distilroberta-base
tags:
- generated_from_trainer
- rejection
- no_answer
- chatgpt
metrics:
- accuracy
- recall
- precision
- f1
model-index:
- name: distilroberta-base-rejection-v1
  results: []
language:
- en
pipeline_tag: text-classification
co2_eq_emissions:
  emissions: 0.07987621556153969
  source: code carbon
  training_type: fine-tuning
datasets:
- argilla/notus-uf-dpo-closest-rejected
---

# Model Card: distilroberta-base-rejection-v1  

This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets.  

The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories:  
- `0`: Normal output  
- `1`: Rejection detected  

On the evaluation set, the model achieves:  
- **Loss:** 0.0544  
- **Accuracy:** 0.9887  
- **Recall:** 0.9810  
- **Precision:** 0.9279  
- **F1 Score:** 0.9537  

---

## Model Details  

- **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com)  
- **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base)  
- **Language(s):** English  
- **License:** Apache 2.0  
- **Task:** Text classification (Rejection detection)  

---

## Intended Use & Limitations  

The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated.  

**Limitations:**  
- Performance depends on the quality and domain of the training data.  
- May underperform on text styles or topics underrepresented in training.  
- Being based on `distilroberta-base`, it is **case-sensitive**.  

---

## Usage  

### With Hugging Face Transformers  

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")
model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1")

classifier = pipeline(
    "text-classification",
    model=model,
    tokenizer=tokenizer,
    truncation=True,
    max_length=512,
    device=torch.device("cuda" if torch.cuda.is_available() else "cpu"),
)

print(classifier("Sorry, but I can't assist with that."))