Fill-Mask
Transformers
Safetensors
modernbert
masked-lm
long-context
timpal0l commited on
Commit
0a09cfa
·
verified ·
1 Parent(s): 69b6272

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +21 -21
README.md CHANGED
@@ -33,7 +33,7 @@ The training is done in one stage with 8192 tokens per sample for the whole run.
33
  |---|---|
34
  | Parameters | 395 M |
35
  | Context length | 8 192 tokens (RoPE + local-global attention) |
36
- | Tokens processed | 9.82 × 10<sup>11</sup> / 1.20 × 10<sup>12</sup> (≈ 82 %) |
37
  | Tokens per batch | 1 572 864 |
38
  | Global batch | 192 sequences (micro-batch = 3) |
39
  | Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
@@ -43,26 +43,26 @@ The training is done in one stage with 8192 tokens per sample for the whole run.
43
  See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
44
  ## Training Stats
45
  ```python
46
- [token=982585522155/1198510347252]:
47
- Train time/batch: 716208
48
- Train time/sample: 137511936
49
- Train time/batch_in_epoch: 716208
50
- Train time/sample_in_epoch: 137511936
51
- Train time/token: 982584117341
52
- Train time/token_in_epoch: 982584117341
53
- Train trainer/device_train_microbatch_size: 3
54
- Train loss/train/total: 0.8162
55
- Train throughput/batches_per_sec: 0.6466
56
- Train throughput/samples_per_sec: 124.1393
57
- Train throughput/device/batches_per_sec: 0.0101
58
- Train throughput/device/samples_per_sec: 1.9397
59
- Train throughput/tokens_per_sec: 887795.9110
60
- Train throughput/device/tokens_per_sec: 13871.8111
61
- Train time/train: 317.5722
62
- Train time/val: 0.0000
63
- Train time/total: 317.5722
64
- Train lr-StableAdamW/group0: 0.0000
65
- Train lr-StableAdamW/group1: 0.0000
66
  ```
67
  ## Intended Use
68
  This is a **research artefact** and is only intended for **research purposes**.
 
33
  |---|---|
34
  | Parameters | 395 M |
35
  | Context length | 8 192 tokens (RoPE + local-global attention) |
36
+ | Tokens processed | 1.20 × 10<sup>12</sup> |
37
  | Tokens per batch | 1 572 864 |
38
  | Global batch | 192 sequences (micro-batch = 3) |
39
  | Optimizer & schedule | Decoupled StableAdamW, lr 2 e-4, cosine decay (1 % warm-up) |
 
43
  See training details [here](https://github.com/timpal0l/ModernBERT/blob/main/training/trainer_lumi.yaml)
44
  ## Training Stats
45
  ```python
46
+ [token=1198511677292/1198510347252]:
47
+ Train time/batch: 873585
48
+ Train time/sample: 167728320
49
+ Train time/batch_in_epoch: 3558
50
+ Train time/sample_in_epoch: 683136
51
+ Train time/token: 1198510256276
52
+ Train time/token_in_epoch: 4882888303
53
+ Train trainer/device_train_microbatch_size: 3
54
+ Train loss/train/total: 0.7730
55
+ Train throughput/batches_per_sec: 0.6293
56
+ Train throughput/samples_per_sec: 120.8212
57
+ Train throughput/device/batches_per_sec: 0.0098
58
+ Train throughput/device/samples_per_sec: 1.8878
59
+ Train throughput/tokens_per_sec: 865578.9851
60
+ Train throughput/device/tokens_per_sec: 13524.6716
61
+ Train time/train: 385.2930
62
+ Train time/val: 0.0000
63
+ Train time/total: 385.2930
64
+ Train lr-StableAdamW/group0: 0.0000
65
+ Train lr-StableAdamW/group1: 0.0000
66
  ```
67
  ## Intended Use
68
  This is a **research artefact** and is only intended for **research purposes**.