gpt2-pretrain / README.md

thecr7guy

Update README.md

ee4a8fc verified 2 months ago

preview code

raw

history blame contribute delete

2.28 kB

metadata

license: mit
datasets:
  - HuggingFaceFW/fineweb-edu
  - common-pile/arxiv_papers_filtered
  - tiiuae/falcon-refinedweb
  - manu/project_gutenberg
  - nampdn-ai/tiny-textbooks
  - SciPhi/textbooks-are-all-you-need-lite
  - abehandlerorg/ccnews
base_model:
  - openai-community/gpt2
pipeline_tag: text-generation

GPT-2 from Scratch

This model implements the GPT-2 architecture (125M parameters) trained from scratch.

Model Description

Model type: GPT-2 (125M parameters)
Architecture: Transformer-based autoregressive language model following the original GPT-2 design
Training data: Uses multiple datasets (check tags) - 18Billion tokens.
Language: English

Performance and Evaluation

Dataset	Metric	thecr7guy/gpt2-pretrain	GPT-2 (baseline)
HellaSwag	acc	0.291	0.289
SciQ	acc	0.754	0.752
Winogrande	acc	0.491	0.516
TruthfulQA MC1	acc	0.236	0.228
MMLU (overall)	acc	0.230	0.229
- Humanities	acc	0.242	0.242
- Social Sci.	acc	0.217	0.217
- STEM	acc	0.213	0.213
- Other	acc	0.239	0.238

Training Details

Training corpus: Approximately 18B tokens (120GB)
Training duration: 1 epochs (approximately 8 hours total)
Hardware: 8× NVIDIA A100 PCE GPUs via runpod.io
Estimated cost: $ (8*13.52) for complete training
Token context: 1024 tokens

Hyperparameters

context_len: 1024
seed: 42
epochs: 2
batch_size: 64
total_batch_size: 524288 tokens
grad_clip: 1.0
optimizer: "adamw"
max_lr: 6.0e-4
min_lr: 6.0e-5
beta1: 0.9
beta2: 0.95
weight_decay: 0.1

Commands used during installation

pip install wandb
pip install tiktoken
pip install --upgrade huggingface_hub
pip install torchinfo
pip install datasets
sudo apt update && sudo apt install tmux
tmux new -s training
wandb login
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 NCCL_P2P_DISABLE=1
torchrun --standalone --nproc_per_node=8 train.py

Contact

GitHub: thecr7guy2