pankajmathur
/

nanochat-d34-sft

+# nanochat training report
+Generated: 2025-12-08 13:02:11
+## Environment
+### Git Information
+- Branch: pankaj_dev
+- Commit: 3289b19 (dirty)
+- Message: Adjust device_batch_size in run_d34_finetune.sh from 4 to 6 for mid-training and
+### Hardware
+- Platform: Linux
+- CPUs: 240 cores (240 logical)
+- Memory: 1771.7 GB
+- GPUs: 8x NVIDIA A100-SXM4-80GB
+- GPU Memory: 634.0 GB total
+- CUDA Version: 12.8
+- Hourly Rate: $14.32/hour
+### Software
+- Python: 3.10.12
+- PyTorch: 2.8.0+cu128
+### Bloat
+- Characters: 446,068
+- Lines: 10,895
+- Files: 53
+- Tokens (approx): 111,517
+- Dependencies (uv.lock lines): 2,218
+Run started: 2025-12-08 13:02:14
+---
+## Midtraining
+timestamp: 2025-12-08 04:41:26
+- run: d34_finetune
+- device_type:
+- dtype: bfloat16
+- num_iterations: -1
+- max_seq_len: 2048
+- device_batch_size: 4
+- unembedding_lr: 0.0040
+- embedding_lr: 0.2000
+- matrix_lr: 0.0200
+- init_lr_frac: 1.0000
+- weight_decay: 0.0000
+- eval_every: 150
+- eval_tokens: 10,485,760
+- total_batch_size: 524,288
+- dry_run: 0
+- Number of iterations: 810
+- DDP world size: 8
+- Minimum validation bpb: 0.3282
+## Chat evaluation mid
+timestamp: 2025-12-08 05:06:16
+- source: mid
+- task_name: None
+- dtype: bfloat16
+- temperature: 0.0000
+- max_new_tokens: 512
+- num_samples: 1
+- top_k: 50
+- batch_size: 8
+- model_tag: None
+- step: None
+- max_problems: None
+- device_type:
+- ARC-Easy: 0.6961
+- ARC-Challenge: 0.5367
+- MMLU: 0.4229
+- GSM8K: 0.1137
+- HumanEval: 0.1098
+- SpellingBee: 0.9961
+- ChatCORE metric: 0.4045
+## Chat SFT
+timestamp: 2025-12-08 05:18:08
+- run: d34_finetune
+- source: mid
+- device_type:
+- dtype: bfloat16
+- device_batch_size: 4
+- num_epochs: 1
+- num_iterations: -1
+- target_examples_per_step: 32
+- unembedding_lr: 0.0040
+- embedding_lr: 0.2000
+- matrix_lr: 0.0200
+- weight_decay: 0.0000
+- init_lr_frac: 0.0200
+- eval_every: 100
+- eval_steps: 100
+- eval_metrics_every: 200
+- eval_metrics_max_problems: 1024
+- Training rows: 22,439
+- Number of iterations: 701
+- Training loss: 0.4230
+- Validation loss: 0.8044
+## Chat evaluation sft
+timestamp: 2025-12-08 05:42:31
+- source: sft
+- task_name: None
+- dtype: bfloat16
+- temperature: 0.0000
+- max_new_tokens: 512
+- num_samples: 1
+- top_k: 50
+- batch_size: 8
+- model_tag: None
+- step: None
+- max_problems: None
+- device_type:
+- ARC-Easy: 0.7210
+- ARC-Challenge: 0.5418
+- MMLU: 0.4304
+- GSM8K: 0.1327
+- HumanEval: 0.1037
+- SpellingBee: 1.0000
+- ChatCORE metric: 0.4157
+## Summary
+- Characters: 440,256
+- Lines: 10,727
+- Files: 52
+- Tokens (approx): 110,064
+- Dependencies (uv.lock lines): 2,218
+| Metric          | BASE     | MID      | SFT      | RL       |
+|-----------------|----------|----------|----------|----------|
+| ARC-Challenge   | -        | 0.5367   | 0.5418   | -        |
+| ARC-Easy        | -        | 0.6961   | 0.7210   | -        |
+| GSM8K           | -        | 0.1137   | 0.1327   | -        |
+| HumanEval       | -        | 0.1098   | 0.1037   | -        |
+| MMLU            | -        | 0.4229   | 0.4304   | -        |
+| ChatCORE        | -        | 0.4045   | 0.4157   | -        |