|
|
--- |
|
|
library_name: transformers |
|
|
license: llama3.2 |
|
|
datasets: |
|
|
- aieng-lab/genter |
|
|
- aieng-lab/namexact |
|
|
language: |
|
|
- en |
|
|
base_model: |
|
|
- meta-llama/Llama-3.2-3B-Instruct |
|
|
--- |
|
|
|
|
|
|
|
|
# GRADIEND Gender-Debiased Llama-3.2-3B-Instruct |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
This model is a gender-debiased version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), modified using [GRADIEND](https://arxiv.org/abs/2502.01406). |
|
|
GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining. |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
|
|
- **Repository:** https://github.com/aieng-lab/gradiend |
|
|
- **Paper:** https://arxiv.org/abs/2502.01406 |
|
|
|
|
|
## Uses |
|
|
|
|
|
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. --> |
|
|
|
|
|
This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools). |
|
|
|
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
<!-- This section is meant to convey both technical and sociotechnical limitations. --> |
|
|
|
|
|
While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model. |
|
|
|
|
|
- Residual gender bias remains. |
|
|
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present. |
|
|
- Fairness-performance trade-offs may exist depending on the use case. |
|
|
|
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
Use the code below to get started with the model. |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
|
|
# Load the tokenizer and the gender-debiased model |
|
|
model_id = "aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiased" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_id) |
|
|
model = AutoModelForCausalLM.from_pretrained(model_id) |
|
|
|
|
|
# Example usage |
|
|
input_text = "The woman worked as a " |
|
|
inputs = tokenizer(input_text, return_tensors="pt") |
|
|
outputs = model(**inputs) |
|
|
logits = outputs.logits |
|
|
|
|
|
# Get the logits of the last token in the input sequence |
|
|
last_token_logits = logits[0, -1, :] |
|
|
|
|
|
# Predict the next token (most probable continuation) |
|
|
predicted_token_id = torch.argmax(last_token_logits) |
|
|
predicted_token = tokenizer.decode(predicted_token_id) |
|
|
|
|
|
print(f"Predicted next token: {predicted_token}") |
|
|
``` |
|
|
|
|
|
Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406). |
|
|
|
|
|
|
|
|
## Training Details |
|
|
|
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. --> |
|
|
|
|
|
Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g., [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology. |
|
|
|
|
|
### GRADIEND Training Data |
|
|
|
|
|
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. --> |
|
|
|
|
|
- [GENTER](https://huggingface.co/datasets/aieng-lab/genter) |
|
|
- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact) |
|
|
|
|
|
|
|
|
## Evaluation |
|
|
|
|
|
<!-- This section describes the evaluation protocols and provides the results. --> |
|
|
|
|
|
The model has been evaluated on: |
|
|
|
|
|
- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133) |
|
|
- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461) |
|
|
|
|
|
Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488). |
|
|
|
|
|
See [Appendix D.2 and Table 12](https://arxiv.org/abs/2502.01406) of the paper for full results. |
|
|
|
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or GRADIEND in your work, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{drechsel2025gradiendmonosemanticfeaturelearning, |
|
|
title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models}, |
|
|
author={Jonathan Drechsel and Steffen Herbold}, |
|
|
year={2025}, |
|
|
eprint={2502.01406}, |
|
|
archivePrefix={arXiv}, |
|
|
primaryClass={cs.LG}, |
|
|
url={https://arxiv.org/abs/2502.01406}, |
|
|
} |
|
|
``` |