--- license: apache-2.0 base_model: distilroberta-base tags: - generated_from_trainer - rejection - no_answer - chatgpt metrics: - accuracy - recall - precision - f1 model-index: - name: distilroberta-base-rejection-v1 results: [] language: - en pipeline_tag: text-classification co2_eq_emissions: emissions: 0.07987621556153969 source: code carbon training_type: fine-tuning datasets: - argilla/notus-uf-dpo-closest-rejected --- # Model Card: distilroberta-base-rejection-v1 This model was originally developed and fine-tuned by **[Protect AI](https://protectai.com/)**. It is a fine-tuned version of [distilroberta-base](https://huggingface.co/distilroberta-base), trained on multiple datasets containing rejection responses from LLMs and standard outputs from RLHF datasets. The goal of this model is to **detect LLM rejections** when a prompt does not pass content moderation. It classifies responses into two categories: - `0`: Normal output - `1`: Rejection detected On the evaluation set, the model achieves: - **Loss:** 0.0544 - **Accuracy:** 0.9887 - **Recall:** 0.9810 - **Precision:** 0.9279 - **F1 Score:** 0.9537 --- ## Model Details - **Developed & fine-tuned by:** [ProtectAI.com](https://protectai.com) - **Base model:** [distilroberta-base](https://huggingface.co/distilroberta-base) - **Language(s):** English - **License:** Apache 2.0 - **Task:** Text classification (Rejection detection) --- ## Intended Use & Limitations The model is designed to **identify rejection responses in LLM outputs**, particularly where a refusal or safeguard message is generated. **Limitations:** - Performance depends on the quality and domain of the training data. - May underperform on text styles or topics underrepresented in training. - Being based on `distilroberta-base`, it is **case-sensitive**. --- ## Usage ### With Hugging Face Transformers ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline import torch tokenizer = AutoTokenizer.from_pretrained("ProtectAI/distilroberta-base-rejection-v1") model = AutoModelForSequenceClassification.from_pretrained("ProtectAI/distilroberta-base-rejection-v1") classifier = pipeline( "text-classification", model=model, tokenizer=tokenizer, truncation=True, max_length=512, device=torch.device("cuda" if torch.cuda.is_available() else "cpu"), ) print(classifier("Sorry, but I can't assist with that."))