Attention Is All You Need
Paper โข 1706.03762 โข Published โข 124
A 896 million parameter GPT-style language model trained from scratch on a single NVIDIA H200 GPU at the Anuradha and Vikas Sinha Department of Data Science housed in the College of Information at the University of North Texas.
This model is based on karpathy/nanochat and was built as an educational demonstration of the transformer architecture described in Attention is All You Need (Vaswani et al., 2017).
| Property | Value |
|---|---|
| Parameters | 896 million |
| Layers (depth) | 20 |
| Attention heads | 10 |
| Embedding size | 1280 |
| Vocabulary size | 32,768 |
| Context length | 2048 tokens |
| Training tokens | 5.2 billion |
| Training time | ~8.5 hours |
| Hardware | 1x NVIDIA H200 (143GB) |
| Tokenizer | BPE (rustbpe) |
| Training data | ClimbMix |
| Benchmark | Score |
|---|---|
| HellaSwag (10-shot) | 0.522 |
| Winograd (0-shot) | 0.626 |
| Winogrande (0-shot) | 0.546 |
| ARC-Easy (10-shot) | 0.328 |
| PIQA (10-shot) | 0.568 |
| CORE | 0.2462 |
This model was trained as part of a workshop demonstrating the full ML pipeline:
This is an educational base model, not a production system. It has no instruction tuning or safety training. It will make factual errors and produce repetitive text on tasks requiring reasoning or arithmetic.
@misc{whitworth2026nanochat,
author = {Clifford K. Whitworth},
title = {nanochat-d20: A GPT trained from scratch on UNT H200s},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/cliffo4567/nanochat-d20}
}