alea-institute commited on
Commit
29080ce
·
verified ·
1 Parent(s): bc5bc26

Upload KL3M multi-word tokenizer v2 (128K) - Update README

Browse files
Files changed (1) hide show
  1. README.md +64 -1
README.md CHANGED
@@ -23,7 +23,8 @@ This is the **131,072 token** variant of the KL3M (Kelvin Legal Large Language M
23
 
24
  The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
25
 
26
- - **Capture multi-word legal phrases** as single tokens (e.g., "Licensee", "hereinafter", "indemnification")
 
27
  - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
28
  - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
29
  - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
@@ -35,6 +36,68 @@ The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (
35
  - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
36
  - **Smaller file sizes**: More efficient tokenizer representation
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Performance Comparison
39
 
40
  On a realistic 3,743-character legal document (Software License Agreement):
 
23
 
24
  The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
25
 
26
+ - **Capture multi-word phrases as single tokens** (e.g., "United States", "set forth", "accordance with")
27
+ - **Encode complex legal terms efficiently** (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
28
  - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
29
  - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
30
  - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
 
36
  - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
37
  - **Smaller file sizes**: More efficient tokenizer representation
38
 
39
+ ## Multi-Word Tokenization Examples
40
+
41
+ **What is multi-word tokenization?** Unlike standard tokenizers that split text into subword pieces, these tokenizers capture **multiple words as single tokens**. This is especially powerful for legal text with common multi-word phrases.
42
+
43
+ ### Example: "The United States Department of Transportation is responsible for"
44
+
45
+ **4K vocabulary** (16 tokens):
46
+ ```
47
+ [The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]
48
+ ```
49
+
50
+ **32K vocabulary** (10 tokens):
51
+ ```
52
+ [The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]
53
+ ```
54
+
55
+ **128K vocabulary** (8 tokens) - **"United States" is ONE token!**
56
+ ```
57
+ [The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]
58
+ ```
59
+
60
+ ### Example: "In accordance with the terms and conditions set forth herein"
61
+
62
+ **4K vocabulary** (16 tokens):
63
+ ```
64
+ [In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]
65
+ ```
66
+
67
+ **32K vocabulary** (10 tokens) - **"accordance with" is ONE token!**
68
+ ```
69
+ [In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]
70
+ ```
71
+
72
+ **128K vocabulary** (8 tokens) - **"accordance with" and "set forth" are single tokens!**
73
+ ```
74
+ [In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]
75
+ ```
76
+
77
+ ### Example: "The Supreme Court of the United States held that"
78
+
79
+ **4K vocabulary** (17 tokens):
80
+ ```
81
+ [The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]
82
+ ```
83
+
84
+ **32K vocabulary** (9 tokens):
85
+ ```
86
+ [The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]
87
+ ```
88
+
89
+ **128K vocabulary** (7 tokens) - **"United States" is ONE token!**
90
+ ```
91
+ [The][ Supreme][ Court][ of the][ United States][ held][ that]
92
+ ```
93
+
94
+ ### Why This Matters
95
+
96
+ 1. **Shorter sequences** = faster inference and training
97
+ 2. **Semantic coherence** = "United States" as one unit, not two separate words
98
+ 3. **Better legal understanding** = common legal phrases encoded atomically
99
+ 4. **Efficient compression** = 7.5% fewer tokens than GPT-4 on legal text
100
+
101
  ## Performance Comparison
102
 
103
  On a realistic 3,743-character legal document (Software License Agreement):