AI-Sweden-Models
/

ModernBERT-large

Model card Files Files and versions

timpal0l commited on Sep 21

Commit

0a09cfa

·

verified ·

1 Parent(s): 69b6272

Update README.md

Files changed (1) hide show

README.md +21 -21

README.md CHANGED Viewed

@@ -33,7 +33,7 @@ The training is done in one stage with 8192 tokens per sample for the whole run.
 |---|---|
 | Parameters | 395 M |
 | Context length | 8 192 tokens (RoPE + local-global attention) |
-| Tokens processed | 9.82 × 10<sup>11</sup> / 1.20 × 10<sup>12</sup> (≈ 82 %) |
 | Tokens per batch | 1 572 864 |
 | Global batch | 192 sequences (micro-batch = 3) |
 | Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
@@ -43,26 +43,26 @@ The training is done in one stage with 8192 tokens per sample for the whole run.
 See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
 ## Training Stats
 ```python
-[token=982585522155/1198510347252]:
-     Train time/batch: 716208
-     Train time/sample: 137511936
-     Train time/batch_in_epoch: 716208
-     Train time/sample_in_epoch: 137511936
-     Train time/token: 982584117341
-     Train time/token_in_epoch: 982584117341
-     Train trainer/device_train_microbatch_size: 3
-     Train loss/train/total: 0.8162
-     Train throughput/batches_per_sec: 0.6466
-     Train throughput/samples_per_sec: 124.1393
-     Train throughput/device/batches_per_sec: 0.0101
-     Train throughput/device/samples_per_sec: 1.9397
-     Train throughput/tokens_per_sec: 887795.9110
-     Train throughput/device/tokens_per_sec: 13871.8111
-     Train time/train: 317.5722
-     Train time/val: 0.0000
-     Train time/total: 317.5722
-     Train lr-StableAdamW/group0: 0.0000
-     Train lr-StableAdamW/group1: 0.0000
 ```
 ## Intended Use
 This is a **research artefact** and is only intended for **research purposes**.

 |---|---|
 | Parameters | 395 M |
 | Context length | 8 192 tokens (RoPE + local-global attention) |
+| Tokens processed | 1.20 × 10<sup>12</sup> |
 | Tokens per batch | 1 572 864 |
 | Global batch | 192 sequences (micro-batch = 3) |
 | Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
 See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
 ## Training Stats
 ```python
+[token=1198511677292/1198510347252]:
+	 Train time/batch: 873585
+	 Train time/sample: 167728320
+	 Train time/batch_in_epoch: 3558
+	 Train time/sample_in_epoch: 683136
+	 Train time/token: 1198510256276
+	 Train time/token_in_epoch: 4882888303
+	 Train trainer/device_train_microbatch_size: 3
+	 Train loss/train/total: 0.7730
+	 Train throughput/batches_per_sec: 0.6293
+	 Train throughput/samples_per_sec: 120.8212
+	 Train throughput/device/batches_per_sec: 0.0098
+	 Train throughput/device/samples_per_sec: 1.8878
+	 Train throughput/tokens_per_sec: 865578.9851
+	 Train throughput/device/tokens_per_sec: 13524.6716
+	 Train time/train: 385.2930
+	 Train time/val: 0.0000
+	 Train time/total: 385.2930
+	 Train lr-StableAdamW/group0: 0.0000
+	 Train lr-StableAdamW/group1: 0.0000
 ```
 ## Intended Use
 This is a **research artefact** and is only intended for **research purposes**.