alea-institute
/

kl3m-multi-word-002-128k

@@ -23,7 +23,8 @@ This is the **131,072 token** variant of the KL3M (Kelvin Legal Large Language M
 The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
-- **Capture multi-word legal phrases** as single tokens (e.g., "Licensee", "hereinafter", "indemnification")
 - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
 - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
 - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
@@ -35,6 +36,68 @@ The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (
 - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
 - **Smaller file sizes**: More efficient tokenizer representation
 ## Performance Comparison
 On a realistic 3,743-character legal document (Software License Agreement):

 The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
+- **Capture multi-word phrases as single tokens** (e.g., "United States", "set forth", "accordance with")
+- **Encode complex legal terms efficiently** (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
 - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
 - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
 - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
 - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
 - **Smaller file sizes**: More efficient tokenizer representation
+## Multi-Word Tokenization Examples
+**What is multi-word tokenization?** Unlike standard tokenizers that split text into subword pieces, these tokenizers capture **multiple words as single tokens**. This is especially powerful for legal text with common multi-word phrases.
+### Example: "The United States Department of Transportation is responsible for"
+**4K vocabulary** (16 tokens):
+```
+[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]
+```
+**32K vocabulary** (10 tokens):
+```
+[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]
+```
+**128K vocabulary** (8 tokens) - **"United States" is ONE token!**
+```
+[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]
+```
+### Example: "In accordance with the terms and conditions set forth herein"
+**4K vocabulary** (16 tokens):
+```
+[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]
+```
+**32K vocabulary** (10 tokens) - **"accordance with" is ONE token!**
+```
+[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]
+```
+**128K vocabulary** (8 tokens) - **"accordance with" and "set forth" are single tokens!**
+```
+[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]
+```
+### Example: "The Supreme Court of the United States held that"
+**4K vocabulary** (17 tokens):
+```
+[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]
+```
+**32K vocabulary** (9 tokens):
+```
+[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]
+```
+**128K vocabulary** (7 tokens) - **"United States" is ONE token!**
+```
+[The][ Supreme][ Court][ of the][ United States][ held][ that]
+```
+### Why This Matters
+1. **Shorter sequences** = faster inference and training
+2. **Semantic coherence** = "United States" as one unit, not two separate words
+3. **Better legal understanding** = common legal phrases encoded atomically
+4. **Efficient compression** = 7.5% fewer tokens than GPT-4 on legal text
 ## Performance Comparison
 On a realistic 3,743-character legal document (Software License Agreement):