--- license: mit --- # gated-deltanet-attn-0.4B-10B Gated DeltaNet + full attention (0.4B params, 10B tokens) ## Overview * **Training**: gated-deltanet-attn-0.4B-10B was trained on [FineWeb-Edu](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu), which is realeased under [ODC-By v1.0](https://opendatacommons.org/licenses/by/1-0/) * **Parameters**: 0.4B * **Task**: Language modeling * **Framework**: HuggingFace, [flash-linear-attention](https://github.com/fla-org/flash-linear-attention) * **Output structure**: [batch_size, sequence_length, num_logits] ## Performance Various; available in paper ## Running Code * Minimal code to instantiate the model and perform inference: ```python # Requires flash-linear-attention (https://github.com/fla-org/flash-linear-attention) import fla from transformers import AutoModelForCausalLM, AutoTokenizer model = AutoModelForCausalLM.from_pretrained(path_to_model).cuda() tokenizer = AutoTokenizer.from_pretrained(path_to_model).cuda() input_ids = tokenizer("All human beings are", return_tensors="pt").input_ids outputs = model.generate(input_ids, max_length=15) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ## License Gated DeltaNet is released under [MIT License](LICENSE.txt) ## Citation If you find our work useful, please cite the following publication: ```bibtex @misc{he_alleviating_2025, title = {Alleviating {Forgetfulness} of {Linear} {Attention} by {Hybrid} {Sparse} {Attention} and {Contextualized} {Learnable} {Token} {Eviction}}, url = {http://arxiv.org/abs/2510.20787}, doi = {10.48550/arXiv.2510.20787}, publisher = {arXiv}, author = {He, Mutian and Garner, Philip N.}, month = oct, year = {2025}, note = {arXiv:2510.20787 [cs]}, } ```