---
license: mit
---

# gated-deltanet-attn-0.4B-10B

Gated DeltaNet + full attention (0.4B params, 10B tokens)

## Overview

* **Training**: gated-deltanet-attn-0.4B-10B was trained on [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), which is realeased under [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/)
* **Parameters**: 0.4B
* **Task**: Language modeling
* **Framework**: HuggingFace, [flash-linear-attention](https://github.com/fla-org/flash-linear-attention)
* **Output structure**: [batch_size, sequence_length, num_logits]

## Performance

Various; available in paper

## Running Code

* Minimal code to instantiate the model and perform inference:
```python
# Requires flash-linear-attention (https://github.com/fla-org/flash-linear-attention)
import fla
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(path_to_model).cuda()
tokenizer = AutoTokenizer.from_pretrained(path_to_model).cuda()
input_ids = tokenizer("All human beings are", return_tensors="pt").input_ids
outputs = model.generate(input_ids, max_length=15)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
```

## License

Gated DeltaNet is released under [MIT License](LICENSE.txt)

## Citation

If you find our work useful, please cite the following publication:

```bibtex
@misc{he_alleviating_2025,
    title = {Alleviating {Forgetfulness} of {Linear} {Attention} by {Hybrid} {Sparse} {Attention} and {Contextualized} {Learnable} {Token} {Eviction}},
    url = {http://arxiv.org/abs/2510.20787},
    doi = {10.48550/arXiv.2510.20787},
    publisher = {arXiv},
    author = {He, Mutian and Garner, Philip N.},
    month = oct,
    year = {2025},
    note = {arXiv:2510.20787 [cs]},
}
```