KL3M Multi-Word Tokenizer v2 - 4K
This is the 4,096 token variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting.
Overview
The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
- Capture multi-word phrases as single tokens (e.g., "United States", "set forth", "accordance with")
- Encode complex legal terms efficiently (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
- Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
- Outperform GPT-4 on legal text with 7.5% better compression on legal documents
- Enable vocabulary expansion experiments and transfer learning across vocabulary sizes
What's New in v2
- Cleaner special token design: 7 special tokens (removed experimental symbols)
- Improved legal domain optimization: Better encoding of common legal terminology
- Superior compression: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
- Smaller file sizes: More efficient tokenizer representation
Multi-Word Tokenization Examples
What is multi-word tokenization? Unlike standard tokenizers that split text into subword pieces, these tokenizers capture multiple words as single tokens. This is especially powerful for legal text with common multi-word phrases.
Example: "The United States Department of Transportation is responsible for"
4K vocabulary (16 tokens):
[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]
32K vocabulary (10 tokens):
[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]
128K vocabulary (8 tokens) - "United States" is ONE token!
[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]
Example: "In accordance with the terms and conditions set forth herein"
4K vocabulary (16 tokens):
[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]
32K vocabulary (10 tokens) - "accordance with" is ONE token!
[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]
128K vocabulary (8 tokens) - "accordance with" and "set forth" are single tokens!
[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]
Example: "The Supreme Court of the United States held that"
4K vocabulary (17 tokens):
[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]
32K vocabulary (9 tokens):
[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]
128K vocabulary (7 tokens) - "United States" is ONE token!
[The][ Supreme][ Court][ of the][ United States][ held][ that]
Why This Matters
- Shorter sequences = faster inference and training
- Semantic coherence = "United States" as one unit, not two separate words
- Better legal understanding = common legal phrases encoded atomically
- Efficient compression = 7.5% fewer tokens than GPT-4 on legal text
Performance Comparison
On a realistic 3,743-character legal document (Software License Agreement):
| Tokenizer | Vocab Size | Tokens | Chars/Token | vs GPT-4 |
|---|---|---|---|---|
| KL3M v2-128K | 131,072 | 704 | 5.32 | -7.5% |
| GPT-4o/5 | 200,019 | 757 | 4.94 | +0.5% |
| GPT-4 | 100,277 | 761 | 4.92 | baseline |
| GPT-2 | 50,257 | 858 | 4.36 | +12.7% |
| KL3M v2-64K | 65,536 | 802 | 4.67 | +5.4% |
| KL3M v2-32K | 32,768 | 943 | 3.97 | +23.9% |
Legal Terminology Efficiency
Common legal terms as single tokens (128K vocab):
| Term | KL3M v2-128K | GPT-4 | GPT-4o/5 |
|---|---|---|---|
| "Licensee" | 1 token | 2 tokens | 2 tokens |
| "hereinafter" | 1 token | 3 tokens | 3 tokens |
| "indemnification" | 1 token | 4 tokens | 3 tokens |
| "arbitration" | 1 token | 3 tokens | 3 tokens |
| "WHEREAS" | 1 token | 2 tokens | 2 tokens |
| "non-exclusive" | 1 token | 2 tokens | 2 tokens |
Tokenizer Family
This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:
| Vocabulary Size | HuggingFace Repository | File Size |
|---|---|---|
| 4,096 (4K) | alea-institute/kl3m-multi-word-002-4k | 248 KB |
| 8,192 (8K) | alea-institute/kl3m-multi-word-002-8k | 516 KB |
| 16,384 (16K) | alea-institute/kl3m-multi-word-002-16k | 1.1 MB |
| 32,768 (32K) | alea-institute/kl3m-multi-word-002-32k | 2.1 MB |
| 65,536 (64K) | alea-institute/kl3m-multi-word-002-64k | 4.4 MB |
| 131,072 (128K) | alea-institute/kl3m-multi-word-002-128k | 8.9 MB |
โ You are viewing: 4,096 (4K)
Key Features
1. Multi-Word Tokenization
Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
4K vocabulary examples:
- Common legal particles: "herein", "thereof", "pursuant"
- Basic legal terms: "shall", "party", "agreement"
32K vocabulary examples:
- Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration"
- Multi-word phrases: "intellectual property", "force majeure"
128K vocabulary examples:
- Specialized terms: "indemnification", "confidentiality", "non-exclusive"
- Complete phrases: "representations and warranties", "WHEREAS"
2. Hierarchical Vocabulary Nesting
Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:
- Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
- Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
- Controlled ablations: Compare vocab sizes while maintaining token alignment
- Model compression: Train with large vocab, deploy with smaller vocab
3. Legal Domain Optimization
Trained on the KL3M corpus (44GB of legal text):
- Court opinions and case law
- Contracts and agreements
- Patents and IP documents
- Legal briefs and filings
- Statutory and regulatory text
This specialized training produces:
- Better compression on legal documents (5.32 chars/token vs 4.92 for GPT-4)
- Semantic coherence for legal multi-word expressions
- Reduced sequence lengths leading to faster inference
4. Special Tokens (v2)
Seven essential special tokens for language model training:
| Token | ID | Purpose |
|---|---|---|
<|start|> |
0 | Start of sequence |
<|end|> |
1 | End of sequence |
<|pad|> |
2 | Padding token |
<|unk|> |
3 | Unknown token |
<|cls|> |
4 | Classification (BERT) |
<|sep|> |
5 | Separator (BERT) |
<|mask|> |
6 | Mask token (MLM) |
Note: v2 removes experimental symbols (โง, โ, โต) from v1 for cleaner design.
Usage
With Transformers
from transformers import PreTrainedTokenizerFast
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"alea-institute/kl3m-multi-word-002-4k"
)
# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")
# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
With tokenizers Library
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
"alea-institute/kl3m-multi-word-002-4k"
)
# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")
Training a Model
from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast
# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
"alea-institute/kl3m-multi-word-002-4k"
)
# Create model config
config = AutoConfig.from_pretrained(
"bert-base-uncased",
vocab_size=tokenizer.vocab_size,
)
# Initialize model
model = AutoModelForMaskedLM.from_config(config)
# Train with HuggingFace Trainer...
Technical Details
Training Corpus
- Source: KL3M (Kelvin Legal Large Language Model) dataset
- Size: ~44.2 GB (44,168,540,153 bytes)
- Lines: 1,018,355,750
- Words: 5,997,814,602
- Domain: Legal documents (copyright-clean)
Training Parameters
bbpe train \
--max-entropy 7.0 \
--preprocessor unicode-whitespace \
--preprocessor-probability 0.1 \
--vocab-size 131072 \
--family-size 65536 --family-size 32768 \
--family-size 16384 --family-size 8192 --family-size 4096
- Max entropy: 7.0 (balances multi-word phrases with common tokens)
- Preprocessing: Unicode whitespace normalization (10% probability)
- Byte fallback: Enabled (handles any input)
Vocabulary Structure
- Base vocabulary: 256 bytes + 49 extended chars = 305 base tokens
- Learned merges: vocab_size - 305 - 7 (special tokens)
- Nesting property: All tokens in size N exist in size 2N
Recommendations
By Use Case
Legal Document Processing (contracts, patents, briefs):
- Best: 128K or 64K vocab
- Rationale: Maximum compression of legal terminology
- Benefit: Shorter sequences, faster inference
Resource-Constrained Environments:
- Best: 16K or 32K vocab
- Rationale: Good balance of compression and model size
- Benefit: Smaller embedding layers, less memory
Experimentation / Research:
- Best: Multiple sizes with vocabulary expansion
- Rationale: Leverage nested structure for novel training strategies
- Benefit: Test curriculum learning, transfer learning
Model Size Guidelines
Choose vocab size based on your model parameter count:
| Model Size | Recommended Vocab | Embedding Parameters |
|---|---|---|
| <200M params | 4K-8K | 3-6M |
| 200M-500M params | 8K-16K | 6-13M |
| 500M-1B params | 16K-32K | 13-26M |
| 1B-3B params | 32K-64K | 26-52M |
| >3B params | 64K-128K | 52-104M |
Limitations
- Training domain: Optimized for legal English text; may underperform on other domains
- Multilingual: Trained primarily on English; limited non-English support
- Code: Less optimized for code compared to code-specific tokenizers
- Vocabulary size: Larger vocabs (64K+) require more embedding memory
Citation
If you use these tokenizers in your research, please cite:
@misc{kl3m-multi-word-002,
title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain},
author={ALEA Institute},
year={2025},
publisher={HuggingFace},
url={https://huggingface.co/alea-institute/kl3m-multi-word-002-4k}
}
License
MIT License
About ALEA Institute
The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers.
Related Resources
- KL3M Dataset: aleainstitute.ai/work/kl3m
- bbpe Tokenizer Trainer: github.com/microsoft/blingfire
- v1 Tokenizers: alea-institute/kl3m-multi-word-001-*
Version History
- v2 (002 series): Improved special token design, better legal optimization
- v1 (001 series): Initial release with 10 special tokens
Contact
For questions or issues, please contact: [contact information TBD]