---
language:
- en
license: mit
tags:
- tokenizer
- legal
- bpe
- byte-pair-encoding
- multi-word
- kl3m
- legal-domain
- hierarchical
pipeline_tag: fill-mask
library_name: transformers
---

# KL3M Multi-Word Tokenizer v2 - 32K

This is the **32,768 token** variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting.

## Overview

The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:

- **Capture multi-word phrases as single tokens** (e.g., "United States", "set forth", "accordance with")
- **Encode complex legal terms efficiently** (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
- **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
- **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
- **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes

## What's New in v2

- **Cleaner special token design**: 7 special tokens (removed experimental symbols)
- **Improved legal domain optimization**: Better encoding of common legal terminology
- **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
- **Smaller file sizes**: More efficient tokenizer representation

## Multi-Word Tokenization Examples

**What is multi-word tokenization?** Unlike standard tokenizers that split text into subword pieces, these tokenizers capture **multiple words as single tokens**. This is especially powerful for legal text with common multi-word phrases.

### Example: "The United States Department of Transportation is responsible for"

**4K vocabulary** (16 tokens):
```
[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]
```

**32K vocabulary** (10 tokens):
```
[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]
```

**128K vocabulary** (8 tokens) - **"United States" is ONE token!**
```
[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]
```

### Example: "In accordance with the terms and conditions set forth herein"

**4K vocabulary** (16 tokens):
```
[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]
```

**32K vocabulary** (10 tokens) - **"accordance with" is ONE token!**
```
[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]
```

**128K vocabulary** (8 tokens) - **"accordance with" and "set forth" are single tokens!**
```
[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]
```

### Example: "The Supreme Court of the United States held that"

**4K vocabulary** (17 tokens):
```
[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]
```

**32K vocabulary** (9 tokens):
```
[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]
```

**128K vocabulary** (7 tokens) - **"United States" is ONE token!**
```
[The][ Supreme][ Court][ of the][ United States][ held][ that]
```

### Why This Matters

1. **Shorter sequences** = faster inference and training
2. **Semantic coherence** = "United States" as one unit, not two separate words
3. **Better legal understanding** = common legal phrases encoded atomically
4. **Efficient compression** = 7.5% fewer tokens than GPT-4 on legal text

## Performance Comparison

On a realistic 3,743-character legal document (Software License Agreement):

| Tokenizer | Vocab Size | Tokens | Chars/Token | vs GPT-4 |
|-----------|------------|--------|-------------|----------|
| **KL3M v2-128K** | 131,072 | **704** | **5.32** | **-7.5%** |
| GPT-4o/5 | 200,019 | 757 | 4.94 | +0.5% |
| GPT-4 | 100,277 | 761 | 4.92 | baseline |
| GPT-2 | 50,257 | 858 | 4.36 | +12.7% |
| KL3M v2-64K | 65,536 | 802 | 4.67 | +5.4% |
| KL3M v2-32K | 32,768 | 943 | 3.97 | +23.9% |

### Legal Terminology Efficiency

Common legal terms as single tokens (128K vocab):

| Term | KL3M v2-128K | GPT-4 | GPT-4o/5 |
|------|--------------|-------|----------|
| "Licensee" | 1 token | 2 tokens | 2 tokens |
| "hereinafter" | 1 token | 3 tokens | 3 tokens |
| "indemnification" | 1 token | 4 tokens | 3 tokens |
| "arbitration" | 1 token | 3 tokens | 3 tokens |
| "WHEREAS" | 1 token | 2 tokens | 2 tokens |
| "non-exclusive" | 1 token | 2 tokens | 2 tokens |

## Tokenizer Family

This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are **identical** across all larger vocabularies, enabling seamless vocabulary expansion:

| Vocabulary Size | HuggingFace Repository | File Size |
|----------------|------------------------|-----------|
| 4,096 (4K) | [alea-institute/kl3m-multi-word-002-4k](https://huggingface.co/alea-institute/kl3m-multi-word-002-4k) | 248 KB |
| 8,192 (8K) | [alea-institute/kl3m-multi-word-002-8k](https://huggingface.co/alea-institute/kl3m-multi-word-002-8k) | 516 KB |
| 16,384 (16K) | [alea-institute/kl3m-multi-word-002-16k](https://huggingface.co/alea-institute/kl3m-multi-word-002-16k) | 1.1 MB |
| 32,768 (32K) | [alea-institute/kl3m-multi-word-002-32k](https://huggingface.co/alea-institute/kl3m-multi-word-002-32k) | 2.1 MB |
| 65,536 (64K) | [alea-institute/kl3m-multi-word-002-64k](https://huggingface.co/alea-institute/kl3m-multi-word-002-64k) | 4.4 MB |
| 131,072 (128K) | [alea-institute/kl3m-multi-word-002-128k](https://huggingface.co/alea-institute/kl3m-multi-word-002-128k) | 8.9 MB |

**→ You are viewing: 32,768 (32K)**

## Key Features

### 1. Multi-Word Tokenization

Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:

**4K vocabulary examples:**
- Common legal particles: "herein", "thereof", "pursuant"
- Basic legal terms: "shall", "party", "agreement"

**32K vocabulary examples:**
- Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration"
- Multi-word phrases: "intellectual property", "force majeure"

**128K vocabulary examples:**
- Specialized terms: "indemnification", "confidentiality", "non-exclusive"
- Complete phrases: "representations and warranties", "WHEREAS"

### 2. Hierarchical Vocabulary Nesting

Token IDs 0-4,095 are **identical** across all tokenizer sizes. This enables:

- **Vocabulary expansion during training**: Start with 4K vocab, expand to 32K mid-training
- **Transfer learning**: Initialize larger vocab models from smaller vocab checkpoints
- **Controlled ablations**: Compare vocab sizes while maintaining token alignment
- **Model compression**: Train with large vocab, deploy with smaller vocab

### 3. Legal Domain Optimization

Trained on the KL3M corpus (44GB of legal text):
- Court opinions and case law
- Contracts and agreements
- Patents and IP documents
- Legal briefs and filings
- Statutory and regulatory text

This specialized training produces:
- **Better compression** on legal documents (5.32 chars/token vs 4.92 for GPT-4)
- **Semantic coherence** for legal multi-word expressions
- **Reduced sequence lengths** leading to faster inference

### 4. Special Tokens (v2)

Seven essential special tokens for language model training:

| Token | ID | Purpose |
|-------|----|---------  |
| `<\|start\|>` | 0 | Start of sequence |
| `<\|end\|>` | 1 | End of sequence |
| `<\|pad\|>` | 2 | Padding token |
| `<\|unk\|>` | 3 | Unknown token |
| `<\|cls\|>` | 4 | Classification (BERT) |
| `<\|sep\|>` | 5 | Separator (BERT) |
| `<\|mask\|>` | 6 | Mask token (MLM) |

*Note: v2 removes experimental symbols (⧈, ⚖, ⏵) from v1 for cleaner design.*

## Usage

### With Transformers

```python
from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-32k"
)

# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")

# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")
```

### With tokenizers Library

```python
from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
    "alea-institute/kl3m-multi-word-002-32k"
)

# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")
```

### Training a Model

```python
from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-32k"
)

# Create model config
config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    vocab_size=tokenizer.vocab_size,
)

# Initialize model
model = AutoModelForMaskedLM.from_config(config)

# Train with HuggingFace Trainer...
```

## Technical Details

### Training Corpus

- **Source**: KL3M (Kelvin Legal Large Language Model) dataset
- **Size**: ~44.2 GB (44,168,540,153 bytes)
- **Lines**: 1,018,355,750
- **Words**: 5,997,814,602
- **Domain**: Legal documents (copyright-clean)

### Training Parameters

```bash
bbpe train \
  --max-entropy 7.0 \
  --preprocessor unicode-whitespace \
  --preprocessor-probability 0.1 \
  --vocab-size 131072 \
  --family-size 65536 --family-size 32768 \
  --family-size 16384 --family-size 8192 --family-size 4096
```

- **Max entropy**: 7.0 (balances multi-word phrases with common tokens)
- **Preprocessing**: Unicode whitespace normalization (10% probability)
- **Byte fallback**: Enabled (handles any input)

### Vocabulary Structure

- **Base vocabulary**: 256 bytes + 49 extended chars = 305 base tokens
- **Learned merges**: vocab_size - 305 - 7 (special tokens)
- **Nesting property**: All tokens in size N exist in size 2N

## Recommendations

### By Use Case

**Legal Document Processing** (contracts, patents, briefs):
- **Best**: 128K or 64K vocab
- **Rationale**: Maximum compression of legal terminology
- **Benefit**: Shorter sequences, faster inference

**Resource-Constrained Environments**:
- **Best**: 16K or 32K vocab
- **Rationale**: Good balance of compression and model size
- **Benefit**: Smaller embedding layers, less memory

**Experimentation / Research**:
- **Best**: Multiple sizes with vocabulary expansion
- **Rationale**: Leverage nested structure for novel training strategies
- **Benefit**: Test curriculum learning, transfer learning

### Model Size Guidelines

Choose vocab size based on your model parameter count:

| Model Size | Recommended Vocab | Embedding Parameters |
|------------|-------------------|----------------------|
| <200M params | 4K-8K | 3-6M |
| 200M-500M params | 8K-16K | 6-13M |
| 500M-1B params | 16K-32K | 13-26M |
| 1B-3B params | 32K-64K | 26-52M |
| >3B params | 64K-128K | 52-104M |

## Limitations

- **Training domain**: Optimized for legal English text; may underperform on other domains
- **Multilingual**: Trained primarily on English; limited non-English support
- **Code**: Less optimized for code compared to code-specific tokenizers
- **Vocabulary size**: Larger vocabs (64K+) require more embedding memory

## Citation

If you use these tokenizers in your research, please cite:

```bibtex
@misc{kl3m-multi-word-002,
  title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain},
  author={ALEA Institute},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/alea-institute/kl3m-multi-word-002-32k}
}
```

## License

MIT License

## About ALEA Institute

The [ALEA Institute](https://aleainstitute.ai) develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers.

## Related Resources

- **KL3M Dataset**: [aleainstitute.ai/work/kl3m](https://aleainstitute.ai/work/kl3m/)
- **bbpe Tokenizer Trainer**: [github.com/microsoft/blingfire](https://github.com/microsoft/blingfire)
- **v1 Tokenizers**: [alea-institute/kl3m-multi-word-001-*](https://huggingface.co/alea-institute/kl3m-multi-word-001-4k)

## Version History

- **v2 (002 series)**: Improved special token design, better legal optimization
- **v1 (001 series)**: Initial release with 10 special tokens

## Contact

For questions or issues, please contact: [contact information TBD]