Upload KL3M multi-word tokenizer v2 (128K) - Update README
Browse files
README.md
CHANGED
|
@@ -23,7 +23,8 @@ This is the **131,072 token** variant of the KL3M (Kelvin Legal Large Language M
|
|
| 23 |
|
| 24 |
The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
|
| 25 |
|
| 26 |
-
- **Capture multi-word
|
|
|
|
| 27 |
- **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
|
| 28 |
- **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
|
| 29 |
- **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
|
|
@@ -35,6 +36,68 @@ The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (
|
|
| 35 |
- **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
|
| 36 |
- **Smaller file sizes**: More efficient tokenizer representation
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Performance Comparison
|
| 39 |
|
| 40 |
On a realistic 3,743-character legal document (Software License Agreement):
|
|
|
|
| 23 |
|
| 24 |
The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
|
| 25 |
|
| 26 |
+
- **Capture multi-word phrases as single tokens** (e.g., "United States", "set forth", "accordance with")
|
| 27 |
+
- **Encode complex legal terms efficiently** (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
|
| 28 |
- **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
|
| 29 |
- **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
|
| 30 |
- **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
|
|
|
|
| 36 |
- **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
|
| 37 |
- **Smaller file sizes**: More efficient tokenizer representation
|
| 38 |
|
| 39 |
+
## Multi-Word Tokenization Examples
|
| 40 |
+
|
| 41 |
+
**What is multi-word tokenization?** Unlike standard tokenizers that split text into subword pieces, these tokenizers capture **multiple words as single tokens**. This is especially powerful for legal text with common multi-word phrases.
|
| 42 |
+
|
| 43 |
+
### Example: "The United States Department of Transportation is responsible for"
|
| 44 |
+
|
| 45 |
+
**4K vocabulary** (16 tokens):
|
| 46 |
+
```
|
| 47 |
+
[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]
|
| 48 |
+
```
|
| 49 |
+
|
| 50 |
+
**32K vocabulary** (10 tokens):
|
| 51 |
+
```
|
| 52 |
+
[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]
|
| 53 |
+
```
|
| 54 |
+
|
| 55 |
+
**128K vocabulary** (8 tokens) - **"United States" is ONE token!**
|
| 56 |
+
```
|
| 57 |
+
[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]
|
| 58 |
+
```
|
| 59 |
+
|
| 60 |
+
### Example: "In accordance with the terms and conditions set forth herein"
|
| 61 |
+
|
| 62 |
+
**4K vocabulary** (16 tokens):
|
| 63 |
+
```
|
| 64 |
+
[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]
|
| 65 |
+
```
|
| 66 |
+
|
| 67 |
+
**32K vocabulary** (10 tokens) - **"accordance with" is ONE token!**
|
| 68 |
+
```
|
| 69 |
+
[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
**128K vocabulary** (8 tokens) - **"accordance with" and "set forth" are single tokens!**
|
| 73 |
+
```
|
| 74 |
+
[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]
|
| 75 |
+
```
|
| 76 |
+
|
| 77 |
+
### Example: "The Supreme Court of the United States held that"
|
| 78 |
+
|
| 79 |
+
**4K vocabulary** (17 tokens):
|
| 80 |
+
```
|
| 81 |
+
[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
**32K vocabulary** (9 tokens):
|
| 85 |
+
```
|
| 86 |
+
[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]
|
| 87 |
+
```
|
| 88 |
+
|
| 89 |
+
**128K vocabulary** (7 tokens) - **"United States" is ONE token!**
|
| 90 |
+
```
|
| 91 |
+
[The][ Supreme][ Court][ of the][ United States][ held][ that]
|
| 92 |
+
```
|
| 93 |
+
|
| 94 |
+
### Why This Matters
|
| 95 |
+
|
| 96 |
+
1. **Shorter sequences** = faster inference and training
|
| 97 |
+
2. **Semantic coherence** = "United States" as one unit, not two separate words
|
| 98 |
+
3. **Better legal understanding** = common legal phrases encoded atomically
|
| 99 |
+
4. **Efficient compression** = 7.5% fewer tokens than GPT-4 on legal text
|
| 100 |
+
|
| 101 |
## Performance Comparison
|
| 102 |
|
| 103 |
On a realistic 3,743-character legal document (Software License Agreement):
|