jdrechsel's picture
Update README.md
3277aa3 verified
---
library_name: transformers
license: llama3.2
datasets:
- aieng-lab/genter
- aieng-lab/namexact
language:
- en
base_model:
- meta-llama/Llama-3.2-3B-Instruct
---
# GRADIEND Gender-Debiased Llama-3.2-3B-Instruct
<!-- Provide a quick summary of what the model is/does. -->
This model is a gender-debiased version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), modified using [GRADIEND](https://arxiv.org/abs/2502.01406).
GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.
### Model Sources
<!-- Provide the basic links for the model. -->
- **Repository:** https://github.com/aieng-lab/gradiend
- **Paper:** https://arxiv.org/abs/2502.01406
## Uses
<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).
## Bias, Risks, and Limitations
<!-- This section is meant to convey both technical and sociotechnical limitations. -->
While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.
- Residual gender bias remains.
- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
- Fairness-performance trade-offs may exist depending on the use case.
## How to Get Started with the Model
Use the code below to get started with the model.
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the tokenizer and the gender-debiased model
model_id = "aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiased"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)
# Example usage
input_text = "The woman worked as a "
inputs = tokenizer(input_text, return_tensors="pt")
outputs = model(**inputs)
logits = outputs.logits
# Get the logits of the last token in the input sequence
last_token_logits = logits[0, -1, :]
# Predict the next token (most probable continuation)
predicted_token_id = torch.argmax(last_token_logits)
predicted_token = tokenizer.decode(predicted_token_id)
print(f"Predicted next token: {predicted_token}")
```
Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406).
## Training Details
### Training Procedure
<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g., [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology.
### GRADIEND Training Data
<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
- [GENTER](https://huggingface.co/datasets/aieng-lab/genter)
- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact)
## Evaluation
<!-- This section describes the evaluation protocols and provides the results. -->
The model has been evaluated on:
- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133)
- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461)
Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488).
See [Appendix D.2 and Table 12](https://arxiv.org/abs/2502.01406) of the paper for full results.
## Citation
If you use this model or GRADIEND in your work, please cite:
```bibtex
@misc{drechsel2025gradiendmonosemanticfeaturelearning,
title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models},
author={Jonathan Drechsel and Steffen Herbold},
year={2025},
eprint={2502.01406},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2502.01406},
}
```