Create README.md
Browse files# DAT Byte Small (200M)
**DAT Byte** is a family of byte-level **D**ifferential-**A**ttention **T**ransformers, trained from scratch on an RTX 5090.
This model is the smallest in the family, with approximately 200 million parameters. It was trained on a set of Discord-style chat content, public-domain books, and English Bible translations. Larger models in the family received a larger and more diverse dataset.
---
## Training Data
As the smallest DAT Byte model, this version was trained on a reduced dataset totaling ~727MB, composed exclusively of the following sources:
- [**Gutenberg English**](https://huggingface.co/datasets/sedthh/gutenberg_english) — English books in the public domain
- [**OpenDiscord**](https://huggingface.co/datasets/hudsongouge/Open-Discord) — Discord dumps in ChatML format
- Proprietary Discord dumps (similar structure and tone to OpenDiscord)
- A diverse set of public domain English Bible translations (~34MB total)
> All listed datasets were used **in full**, and **no additional data sources** were used.
The Discord datasets (combined ~693MB) were formatted in **ChatML**, with usernames serving as speaker roles, enabling the model to learn natural dialogue structure and dynamics.
---
## Architecture
This model follows the structure proposed in [**Differential Transformer** (Ye et al., 2024)](https://arxiv.org/abs/2410.05258), which introduces *Differential Attention*.
Differential Attention is very helpful in creating byte-level LLMs as it reduces attention noise and allows the model to better grasp semantic meaning at such a high granularity.
Key architectural details:
- **Model Type:** Decoder-only Transformer
- **Positional Encoding:** RoPE (Rotary Positional Embeddings)
- **Normalization:** Pre-layernorm (LayerNorm before attention and MLP blocks)
- **Hidden Size:** 768
- **FFN Size:** 3,072
- **Attention Heads:** 12
- **Layers:** 28
- **Vocabulary Size:** 259 (256 byte tokens + 3 special tokens)
---
## Benchmarks
Coming soon
---
## Citation
If you use **DAT Byte Small** in your research, fine-tune it, or build on this work, please cite the original author:
### BibTeX entry
```bibtex
@misc
{gouge2025datbyte,
title = {DAT Byte: Byte-Level Differential Attention Transformers},
author = {Hudson Gouge},
year = {2025},
url = {https://huggingface.co/hudsongouge/dat-byte-small},
note = {DAT Byte Small (200M) Model Card}
}
```
Please include this citation in any derivative work, publication, or project that makes use of the DAT Byte architecture or training artifacts.