dahara1 commited on
Commit
5524075
·
verified ·
1 Parent(s): 6b02c3d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +87 -3
README.md CHANGED
@@ -1,3 +1,87 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ja
5
+ ---
6
+
7
+ ## Environment
8
+
9
+ ### Git Information
10
+ - Branch: master
11
+ - Commit: d4a77fb (dirty)
12
+ - Message: fix token eval2
13
+
14
+ ### Hardware
15
+ - Platform: Linux
16
+ - CPUs: 64 cores (64 logical)
17
+ - Memory: 2015.6 GB
18
+ - GPUs: 8x NVIDIA H100 80GB HBM3
19
+ - GPU Memory: 633.5 GB total
20
+ - CUDA Version: 12.8
21
+ - Hourly Rate: $24.00/hour
22
+
23
+ ### Software
24
+ - Python: 3.11.9
25
+ - PyTorch: 2.9.0+cu128
26
+
27
+
28
+ ### Bloat
29
+ - Characters: 382,832
30
+ - Lines: 9,485
31
+ - Files: 57
32
+ - Tokens (approx): 95,708
33
+ - Dependencies (uv.lock lines): 2,004
34
+
35
+ Run started: 2025-10-16 16:25:24
36
+
37
+ ## Tokenizer evaluation
38
+ timestamp: 2025-10-16 16:25:26
39
+
40
+ ### Comparison with GPT-2
41
+
42
+ | Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
43
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
44
+ | news | 1819 | 404 | 4.50 | 705 | 2.58 | -74.5% |
45
+ | korean | 893 | 745 | 1.20 | 729 | 1.22 | +2.1% |
46
+ | code | 1259 | 576 | 2.19 | 708 | 1.78 | -22.9% |
47
+ | math | 1834 | 936 | 1.96 | 1063 | 1.73 | -13.6% |
48
+ | science | 1112 | 260 | 4.28 | 455 | 2.44 | -75.0% |
49
+ | japanese | 3618 | 2056 | 1.76 | 630 | 5.74 | +69.4% |
50
+
51
+ ### Comparison with GPT-4
52
+
53
+ | Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
54
+ |-----------|-------|--------------|--------------|-------------|------------|-----------------|
55
+ | news | 1819 | 387 | 4.70 | 705 | 2.58 | -82.2% |
56
+ | korean | 893 | 364 | 2.45 | 729 | 1.22 | -100.3% |
57
+ | code | 1259 | 309 | 4.07 | 708 | 1.78 | -129.1% |
58
+ | math | 1834 | 832 | 2.20 | 1063 | 1.73 | -27.8% |
59
+ | science | 1112 | 249 | 4.47 | 455 | 2.44 | -82.7% |
60
+ | japanese | 3618 | 1458 | 2.48 | 630 | 5.74 | +56.8% |
61
+
62
+
63
+ ## Base model training Japanese
64
+ timestamp: 2025-10-16 16:17:09
65
+
66
+ - run: d20-jp-1760620493
67
+ - depth: 20
68
+ - max_seq_len: 2048
69
+ - target_param_data_ratio: 20
70
+ - num_iterations: -1
71
+ - device_batch_size: 32
72
+ - total_batch_size: 524,288
73
+ - embedding_lr: 0.2000
74
+ - unembedding_lr: 0.0040
75
+ - matrix_lr: 0.0200
76
+ - weight_decay: 0.0000
77
+ - eval_every: 250
78
+ - eval_tokens: 10,485,760
79
+ - DATASET_REPO_ID: kajuma/ABEJA-CC-JA-edu
80
+ - CONFIG_NAME: 10%
81
+ - SPLIT: train
82
+ - TOTAL_SHARDS: 378
83
+ - DOWNLOAD_CACHE_DIR: download_cache_jp
84
+ - Number of parameters: 560,988,160
85
+ - Number of training tokens: 11,219,763,200
86
+ - Minimum validation bpb: 0.6473
87
+ - Final validation bpb: 0.6682