Update README.md

3277aa3 verified 6 months ago

5.39 kB

	---
	library_name: transformers
	license: llama3.2
	datasets:
	- aieng-lab/genter
	- aieng-lab/namexact
	language:
	- en
	base_model:
	- meta-llama/Llama-3.2-3B-Instruct
	---


	# GRADIEND Gender-Debiased Llama-3.2-3B-Instruct

	<!-- Provide a quick summary of what the model is/does. -->

	This model is a gender-debiased version of [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct), modified using [GRADIEND](https://arxiv.org/abs/2502.01406).
	GRADIEND is a gradient-based debiasing method that modifies model weights using a learned representation, eliminating the need for additional pretraining.

	### Model Sources

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/aieng-lab/gradiend
	- Paper: https://arxiv.org/abs/2502.01406

	## Uses

	<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

	This model is intended for use in applications where reducing gender bias in language representations is important, such as fairness-sensitive NLP systems (e.g., hiring platforms, educational and medical tools).


	## Bias, Risks, and Limitations

	<!-- This section is meant to convey both technical and sociotechnical limitations. -->

	While the model is designed to reduce gender bias, the debiasing effect is not perfect, but the model is less gender biased than the original model.

	- Residual gender bias remains.
	- Biases related to other protected attributes (e.g., race, age, socioeconomic status) may still be present.
	- Fairness-performance trade-offs may exist depending on the use case.


	## How to Get Started with the Model

	Use the code below to get started with the model.

	```python
	from transformers import AutoTokenizer, AutoModelForCausalLM

	# Load the tokenizer and the gender-debiased model
	model_id = "aieng-lab/Llama-3.2-3B-Instruct-gradiend-gender-debiased"
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	model = AutoModelForCausalLM.from_pretrained(model_id)

	# Example usage
	input_text = "The woman worked as a "
	inputs = tokenizer(input_text, return_tensors="pt")
	outputs = model(**inputs)
	logits = outputs.logits

	# Get the logits of the last token in the input sequence
	last_token_logits = logits[0, -1, :]

	# Predict the next token (most probable continuation)
	predicted_token_id = torch.argmax(last_token_logits)
	predicted_token = tokenizer.decode(predicted_token_id)

	print(f"Predicted next token: {predicted_token}")
	```

	Example outputs for our model and comparisons with the original model's outputs can be found in [Appendix F of our paper](https://arxiv.org/abs/2502.01406).


	## Training Details


	### Training Procedure

	<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

	Unlike traditional debiasing methods based on special pretraining (e.g., ([CDA](https://arxiv.org/abs/1906.04571) and [Dropout](https://arxiv.org/abs/1207.0580)) or post-processing (e.g., [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), [SentenceDebias](https://aclanthology.org/2020.acl-main.488)), this model was debiased using GRADIEND, which learns a representation usable to update the original model weights, resulting in a debiased version. See [Section 3 of the GRADIEND paper](https://arxiv.org/abs/2502.01406) for the full methodology.

	### GRADIEND Training Data

	<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

	- [GENTER](https://huggingface.co/datasets/aieng-lab/genter)
	- [NAMEXACT](https://huggingface.co/datasets/aieng-lab/namexact)


	## Evaluation

	<!-- This section describes the evaluation protocols and provides the results. -->

	The model has been evaluated on:

	- Gender Bias Metrics: [SEAT](https://arxiv.org/abs/2210.08859), [Stereotype Score (SS) of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf), and [CrowS](https://arxiv.org/abs/2010.00133)
	- Language Modeling Metrics: [LMS of StereoSet](https://aclanthology.org/2021.acl-long.416.pdf) and [GLUE](https://arxiv.org/abs/1804.07461)

	Our evaluation compares GRADIEND to other state-of-the-art debiasing methods, including [CDA](https://arxiv.org/abs/1906.04571), [Dropout](https://arxiv.org/abs/1207.0580), [INLP](https://arxiv.org/abs/2004.07667), [RLACE](https://arxiv.org/abs/2201.12091), [LEACE](https://arxiv.org/abs/2306.03819), [SelfDebias](https://arxiv.org/abs/2402.01981), and [SentenceDebias](https://aclanthology.org/2020.acl-main.488).

	See [Appendix D.2 and Table 12](https://arxiv.org/abs/2502.01406) of the paper for full results.


	## Citation

	If you use this model or GRADIEND in your work, please cite:

	```bibtex
	@misc{drechsel2025gradiendmonosemanticfeaturelearning,
	title={{GRADIEND}: Monosemantic Feature Learning within Neural Networks Applied to Gender Debiasing of Transformer Models},
	author={Jonathan Drechsel and Steffen Herbold},
	year={2025},
	eprint={2502.01406},
	archivePrefix={arXiv},
	primaryClass={cs.LG},
	url={https://arxiv.org/abs/2502.01406},
	}
	```