Create README.md

# DAT Byte Small (200M)

**DAT Byte** is a family of byte-level **D**ifferential-**A**ttention **T**ransformers, trained from scratch on an RTX 5090.
This model is the smallest in the family, with approximately 200 million parameters. It was trained on a set of Discord-style chat content, public-domain books, and English Bible translations. Larger models in the family received a larger and more diverse dataset.

---

## Training Data

As the smallest DAT Byte model, this version was trained on a reduced dataset totaling ~727MB, composed exclusively of the following sources:

- [**Gutenberg English**](https://huggingface.co/datasets/sedthh/gutenberg_english) — English books in the public domain
- [**OpenDiscord**](https://huggingface.co/datasets/hudsongouge/Open-Discord) — Discord dumps in ChatML format
- Proprietary Discord dumps (similar structure and tone to OpenDiscord)
- A diverse set of public domain English Bible translations (~34MB total)

> All listed datasets were used **in full**, and **no additional data sources** were used.

The Discord datasets (combined ~693MB) were formatted in **ChatML**, with usernames serving as speaker roles, enabling the model to learn natural dialogue structure and dynamics.

---

## Architecture

This model follows the structure proposed in [**Differential Transformer** (Ye et al., 2024)](https://arxiv.org/abs/2410.05258), which introduces *Differential Attention*.
Differential Attention is very helpful in creating byte-level LLMs as it reduces attention noise and allows the model to better grasp semantic meaning at such a high granularity.

Key architectural details:

- **Model Type:** Decoder-only Transformer
- **Positional Encoding:** RoPE (Rotary Positional Embeddings)
- **Normalization:** Pre-layernorm (LayerNorm before attention and MLP blocks)
- **Hidden Size:** 768
- **FFN Size:** 3,072
- **Attention Heads:** 12
- **Layers:** 28
- **Vocabulary Size:** 259 (256 byte tokens + 3 special tokens)

---

## Benchmarks

Coming soon

---

## Citation

If you use **DAT Byte Small** in your research, fine-tune it, or build on this work, please cite the original author:

### BibTeX entry
```bibtex

@misc
{gouge2025datbyte,
title = {DAT Byte: Byte-Level Differential Attention Transformers},
author = {Hudson Gouge},
year = {2025},
url = {https://huggingface.co/hudsongouge/dat-byte-small},
note = {DAT Byte Small (200M) Model Card}
}
```

Please include this citation in any derivative work, publication, or project that makes use of the DAT Byte architecture or training artifacts.

Files changed (1) hide show

README.md +9 -0

README.md ADDED Viewed

	@@ -0,0 +1,9 @@

+---
+license: apache-2.0
+datasets:
+- hudsongouge/Open-Discord
+- sedthh/gutenberg_english
+language:
+- en
+pipeline_tag: text-generation
+---