--- language: - en license: mit tags: - tokenizer - legal - bpe - byte-pair-encoding - multi-word - kl3m - legal-domain - hierarchical pipeline_tag: fill-mask library_name: transformers --- # KL3M Multi-Word Tokenizer v2 - 32K This is the **32,768 token** variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting. ## Overview The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers: - **Capture multi-word phrases as single tokens** (e.g., "United States", "set forth", "accordance with") - **Encode complex legal terms efficiently** (e.g., "Licensee", "hereinafter", "indemnification" as single tokens) - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes ## What's New in v2 - **Cleaner special token design**: 7 special tokens (removed experimental symbols) - **Improved legal domain optimization**: Better encoding of common legal terminology - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4) - **Smaller file sizes**: More efficient tokenizer representation ## Multi-Word Tokenization Examples **What is multi-word tokenization?** Unlike standard tokenizers that split text into subword pieces, these tokenizers capture **multiple words as single tokens**. This is especially powerful for legal text with common multi-word phrases. ### Example: "The United States Department of Transportation is responsible for" **4K vocabulary** (16 tokens): ``` [The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for] ``` **32K vocabulary** (10 tokens): ``` [The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for] ``` **128K vocabulary** (8 tokens) - **"United States" is ONE token!** ``` [The][ United States][ Department][ of][ Transportation][ is][ responsible][ for] ``` ### Example: "In accordance with the terms and conditions set forth herein" **4K vocabulary** (16 tokens): ``` [In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein] ``` **32K vocabulary** (10 tokens) - **"accordance with" is ONE token!** ``` [In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein] ``` **128K vocabulary** (8 tokens) - **"accordance with" and "set forth" are single tokens!** ``` [In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein] ``` ### Example: "The Supreme Court of the United States held that" **4K vocabulary** (17 tokens): ``` [The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that] ``` **32K vocabulary** (9 tokens): ``` [The][ Sup][reme][ Court][ of the][ United][ States][ held][ that] ``` **128K vocabulary** (7 tokens) - **"United States" is ONE token!** ``` [The][ Supreme][ Court][ of the][ United States][ held][ that] ``` ### Why This Matters 1. **Shorter sequences** = faster inference and training 2. **Semantic coherence** = "United States" as one unit, not two separate words 3. **Better legal understanding** = common legal phrases encoded atomically 4. **Efficient compression** = 7.5% fewer tokens than GPT-4 on legal text ## Performance Comparison On a realistic 3,743-character legal document (Software License Agreement): | Tokenizer | Vocab Size | Tokens | Chars/Token | vs GPT-4 | |-----------|------------|--------|-------------|----------| | **KL3M v2-128K** | 131,072 | **704** | **5.32** | **-7.5%** | | GPT-4o/5 | 200,019 | 757 | 4.94 | +0.5% | | GPT-4 | 100,277 | 761 | 4.92 | baseline | | GPT-2 | 50,257 | 858 | 4.36 | +12.7% | | KL3M v2-64K | 65,536 | 802 | 4.67 | +5.4% | | KL3M v2-32K | 32,768 | 943 | 3.97 | +23.9% | ### Legal Terminology Efficiency Common legal terms as single tokens (128K vocab): | Term | KL3M v2-128K | GPT-4 | GPT-4o/5 | |------|--------------|-------|----------| | "Licensee" | 1 token | 2 tokens | 2 tokens | | "hereinafter" | 1 token | 3 tokens | 3 tokens | | "indemnification" | 1 token | 4 tokens | 3 tokens | | "arbitration" | 1 token | 3 tokens | 3 tokens | | "WHEREAS" | 1 token | 2 tokens | 2 tokens | | "non-exclusive" | 1 token | 2 tokens | 2 tokens | ## Tokenizer Family This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are **identical** across all larger vocabularies, enabling seamless vocabulary expansion: | Vocabulary Size | HuggingFace Repository | File Size | |----------------|------------------------|-----------| | 4,096 (4K) | [alea-institute/kl3m-multi-word-002-4k](https://huggingface.co/alea-institute/kl3m-multi-word-002-4k) | 248 KB | | 8,192 (8K) | [alea-institute/kl3m-multi-word-002-8k](https://huggingface.co/alea-institute/kl3m-multi-word-002-8k) | 516 KB | | 16,384 (16K) | [alea-institute/kl3m-multi-word-002-16k](https://huggingface.co/alea-institute/kl3m-multi-word-002-16k) | 1.1 MB | | 32,768 (32K) | [alea-institute/kl3m-multi-word-002-32k](https://huggingface.co/alea-institute/kl3m-multi-word-002-32k) | 2.1 MB | | 65,536 (64K) | [alea-institute/kl3m-multi-word-002-64k](https://huggingface.co/alea-institute/kl3m-multi-word-002-64k) | 4.4 MB | | 131,072 (128K) | [alea-institute/kl3m-multi-word-002-128k](https://huggingface.co/alea-institute/kl3m-multi-word-002-128k) | 8.9 MB | **→ You are viewing: 32,768 (32K)** ## Key Features ### 1. Multi-Word Tokenization Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology: **4K vocabulary examples:** - Common legal particles: "herein", "thereof", "pursuant" - Basic legal terms: "shall", "party", "agreement" **32K vocabulary examples:** - Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration" - Multi-word phrases: "intellectual property", "force majeure" **128K vocabulary examples:** - Specialized terms: "indemnification", "confidentiality", "non-exclusive" - Complete phrases: "representations and warranties", "WHEREAS" ### 2. Hierarchical Vocabulary Nesting Token IDs 0-4,095 are **identical** across all tokenizer sizes. This enables: - **Vocabulary expansion during training**: Start with 4K vocab, expand to 32K mid-training - **Transfer learning**: Initialize larger vocab models from smaller vocab checkpoints - **Controlled ablations**: Compare vocab sizes while maintaining token alignment - **Model compression**: Train with large vocab, deploy with smaller vocab ### 3. Legal Domain Optimization Trained on the KL3M corpus (44GB of legal text): - Court opinions and case law - Contracts and agreements - Patents and IP documents - Legal briefs and filings - Statutory and regulatory text This specialized training produces: - **Better compression** on legal documents (5.32 chars/token vs 4.92 for GPT-4) - **Semantic coherence** for legal multi-word expressions - **Reduced sequence lengths** leading to faster inference ### 4. Special Tokens (v2) Seven essential special tokens for language model training: | Token | ID | Purpose | |-------|----|--------- | | `<\|start\|>` | 0 | Start of sequence | | `<\|end\|>` | 1 | End of sequence | | `<\|pad\|>` | 2 | Padding token | | `<\|unk\|>` | 3 | Unknown token | | `<\|cls\|>` | 4 | Classification (BERT) | | `<\|sep\|>` | 5 | Separator (BERT) | | `<\|mask\|>` | 6 | Mask token (MLM) | *Note: v2 removes experimental symbols (⧈, ⚖, ⏵) from v1 for cleaner design.* ## Usage ### With Transformers ```python from transformers import PreTrainedTokenizerFast # Load tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained( "alea-institute/kl3m-multi-word-002-32k" ) # Tokenize text text = "The Licensor hereby grants to Licensee a non-exclusive license." tokens = tokenizer.encode(text) print(f"Tokens: {len(tokens)}") # Decode decoded = tokenizer.decode(tokens) print(f"Decoded: {decoded}") ``` ### With tokenizers Library ```python from tokenizers import Tokenizer # Load tokenizer tokenizer = Tokenizer.from_pretrained( "alea-institute/kl3m-multi-word-002-32k" ) # Encode encoding = tokenizer.encode(text) print(f"Tokens: {encoding.tokens}") print(f"IDs: {encoding.ids}") ``` ### Training a Model ```python from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast # Load tokenizer tokenizer = PreTrainedTokenizerFast.from_pretrained( "alea-institute/kl3m-multi-word-002-32k" ) # Create model config config = AutoConfig.from_pretrained( "bert-base-uncased", vocab_size=tokenizer.vocab_size, ) # Initialize model model = AutoModelForMaskedLM.from_config(config) # Train with HuggingFace Trainer... ``` ## Technical Details ### Training Corpus - **Source**: KL3M (Kelvin Legal Large Language Model) dataset - **Size**: ~44.2 GB (44,168,540,153 bytes) - **Lines**: 1,018,355,750 - **Words**: 5,997,814,602 - **Domain**: Legal documents (copyright-clean) ### Training Parameters ```bash bbpe train \ --max-entropy 7.0 \ --preprocessor unicode-whitespace \ --preprocessor-probability 0.1 \ --vocab-size 131072 \ --family-size 65536 --family-size 32768 \ --family-size 16384 --family-size 8192 --family-size 4096 ``` - **Max entropy**: 7.0 (balances multi-word phrases with common tokens) - **Preprocessing**: Unicode whitespace normalization (10% probability) - **Byte fallback**: Enabled (handles any input) ### Vocabulary Structure - **Base vocabulary**: 256 bytes + 49 extended chars = 305 base tokens - **Learned merges**: vocab_size - 305 - 7 (special tokens) - **Nesting property**: All tokens in size N exist in size 2N ## Recommendations ### By Use Case **Legal Document Processing** (contracts, patents, briefs): - **Best**: 128K or 64K vocab - **Rationale**: Maximum compression of legal terminology - **Benefit**: Shorter sequences, faster inference **Resource-Constrained Environments**: - **Best**: 16K or 32K vocab - **Rationale**: Good balance of compression and model size - **Benefit**: Smaller embedding layers, less memory **Experimentation / Research**: - **Best**: Multiple sizes with vocabulary expansion - **Rationale**: Leverage nested structure for novel training strategies - **Benefit**: Test curriculum learning, transfer learning ### Model Size Guidelines Choose vocab size based on your model parameter count: | Model Size | Recommended Vocab | Embedding Parameters | |------------|-------------------|----------------------| | <200M params | 4K-8K | 3-6M | | 200M-500M params | 8K-16K | 6-13M | | 500M-1B params | 16K-32K | 13-26M | | 1B-3B params | 32K-64K | 26-52M | | >3B params | 64K-128K | 52-104M | ## Limitations - **Training domain**: Optimized for legal English text; may underperform on other domains - **Multilingual**: Trained primarily on English; limited non-English support - **Code**: Less optimized for code compared to code-specific tokenizers - **Vocabulary size**: Larger vocabs (64K+) require more embedding memory ## Citation If you use these tokenizers in your research, please cite: ```bibtex @misc{kl3m-multi-word-002, title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain}, author={ALEA Institute}, year={2025}, publisher={HuggingFace}, url={https://huggingface.co/alea-institute/kl3m-multi-word-002-32k} } ``` ## License MIT License ## About ALEA Institute The [ALEA Institute](https://aleainstitute.ai) develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers. ## Related Resources - **KL3M Dataset**: [aleainstitute.ai/work/kl3m](https://aleainstitute.ai/work/kl3m/) - **bbpe Tokenizer Trainer**: [github.com/microsoft/blingfire](https://github.com/microsoft/blingfire) - **v1 Tokenizers**: [alea-institute/kl3m-multi-word-001-*](https://huggingface.co/alea-institute/kl3m-multi-word-001-4k) ## Version History - **v2 (002 series)**: Improved special token design, better legal optimization - **v1 (001 series)**: Initial release with 10 special tokens ## Contact For questions or issues, please contact: [contact information TBD]