KL3M Multi-Word Tokenizer v2 - 4K

This is the 4,096 token variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting.

Overview

The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:

Capture multi-word phrases as single tokens (e.g., "United States", "set forth", "accordance with")
Encode complex legal terms efficiently (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
Outperform GPT-4 on legal text with 7.5% better compression on legal documents
Enable vocabulary expansion experiments and transfer learning across vocabulary sizes

What's New in v2

Cleaner special token design: 7 special tokens (removed experimental symbols)
Improved legal domain optimization: Better encoding of common legal terminology
Superior compression: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
Smaller file sizes: More efficient tokenizer representation

Multi-Word Tokenization Examples

What is multi-word tokenization? Unlike standard tokenizers that split text into subword pieces, these tokenizers capture multiple words as single tokens. This is especially powerful for legal text with common multi-word phrases.

Example: "The United States Department of Transportation is responsible for"

4K vocabulary (16 tokens):

[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]

32K vocabulary (10 tokens):

[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]

128K vocabulary (8 tokens) - "United States" is ONE token!

[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]

Example: "In accordance with the terms and conditions set forth herein"

4K vocabulary (16 tokens):

[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]

32K vocabulary (10 tokens) - "accordance with" is ONE token!

[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]

128K vocabulary (8 tokens) - "accordance with" and "set forth" are single tokens!

[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]

Example: "The Supreme Court of the United States held that"

4K vocabulary (17 tokens):

[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]

32K vocabulary (9 tokens):

[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]

128K vocabulary (7 tokens) - "United States" is ONE token!

[The][ Supreme][ Court][ of the][ United States][ held][ that]

Why This Matters

Shorter sequences = faster inference and training
Semantic coherence = "United States" as one unit, not two separate words
Better legal understanding = common legal phrases encoded atomically
Efficient compression = 7.5% fewer tokens than GPT-4 on legal text

Performance Comparison

On a realistic 3,743-character legal document (Software License Agreement):

Tokenizer	Vocab Size	Tokens	Chars/Token	vs GPT-4
KL3M v2-128K	131,072	704	5.32	-7.5%
GPT-4o/5	200,019	757	4.94	+0.5%
GPT-4	100,277	761	4.92	baseline
GPT-2	50,257	858	4.36	+12.7%
KL3M v2-64K	65,536	802	4.67	+5.4%
KL3M v2-32K	32,768	943	3.97	+23.9%

Legal Terminology Efficiency

Common legal terms as single tokens (128K vocab):

Term	KL3M v2-128K	GPT-4	GPT-4o/5
"Licensee"	1 token	2 tokens	2 tokens
"hereinafter"	1 token	3 tokens	3 tokens
"indemnification"	1 token	4 tokens	3 tokens
"arbitration"	1 token	3 tokens	3 tokens
"WHEREAS"	1 token	2 tokens	2 tokens
"non-exclusive"	1 token	2 tokens	2 tokens

Tokenizer Family

This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:

Vocabulary Size	HuggingFace Repository	File Size
4,096 (4K)	alea-institute/kl3m-multi-word-002-4k	248 KB
8,192 (8K)	alea-institute/kl3m-multi-word-002-8k	516 KB
16,384 (16K)	alea-institute/kl3m-multi-word-002-16k	1.1 MB
32,768 (32K)	alea-institute/kl3m-multi-word-002-32k	2.1 MB
65,536 (64K)	alea-institute/kl3m-multi-word-002-64k	4.4 MB
131,072 (128K)	alea-institute/kl3m-multi-word-002-128k	8.9 MB

→ You are viewing: 4,096 (4K)

Key Features

1. Multi-Word Tokenization

Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:

4K vocabulary examples:

Common legal particles: "herein", "thereof", "pursuant"
Basic legal terms: "shall", "party", "agreement"

32K vocabulary examples:

Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration"
Multi-word phrases: "intellectual property", "force majeure"

128K vocabulary examples:

Specialized terms: "indemnification", "confidentiality", "non-exclusive"
Complete phrases: "representations and warranties", "WHEREAS"

2. Hierarchical Vocabulary Nesting

Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:

Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
Controlled ablations: Compare vocab sizes while maintaining token alignment
Model compression: Train with large vocab, deploy with smaller vocab

3. Legal Domain Optimization

Trained on the KL3M corpus (44GB of legal text):

Court opinions and case law
Contracts and agreements
Patents and IP documents
Legal briefs and filings
Statutory and regulatory text

This specialized training produces:

Better compression on legal documents (5.32 chars/token vs 4.92 for GPT-4)
Semantic coherence for legal multi-word expressions
Reduced sequence lengths leading to faster inference

4. Special Tokens (v2)

Seven essential special tokens for language model training:

Token	ID	Purpose
`<\|start\|>`	0	Start of sequence
`<\|end\|>`	1	End of sequence
`<\|pad\|>`	2	Padding token
`<\|unk\|>`	3	Unknown token
`<\|cls\|>`	4	Classification (BERT)
`<\|sep\|>`	5	Separator (BERT)
`<\|mask\|>`	6	Mask token (MLM)

Note: v2 removes experimental symbols (⧈, ⚖, ⏵) from v1 for cleaner design.

Usage

With Transformers

from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")

# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

With tokenizers Library

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

Training a Model

from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Create model config
config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    vocab_size=tokenizer.vocab_size,
)

# Initialize model
model = AutoModelForMaskedLM.from_config(config)

# Train with HuggingFace Trainer...

Technical Details

Training Corpus

Source: KL3M (Kelvin Legal Large Language Model) dataset
Size: ~44.2 GB (44,168,540,153 bytes)
Lines: 1,018,355,750
Words: 5,997,814,602
Domain: Legal documents (copyright-clean)

Training Parameters

bbpe train \
  --max-entropy 7.0 \
  --preprocessor unicode-whitespace \
  --preprocessor-probability 0.1 \
  --vocab-size 131072 \
  --family-size 65536 --family-size 32768 \
  --family-size 16384 --family-size 8192 --family-size 4096

Max entropy: 7.0 (balances multi-word phrases with common tokens)
Preprocessing: Unicode whitespace normalization (10% probability)
Byte fallback: Enabled (handles any input)

Vocabulary Structure

Base vocabulary: 256 bytes + 49 extended chars = 305 base tokens
Learned merges: vocab_size - 305 - 7 (special tokens)
Nesting property: All tokens in size N exist in size 2N

Recommendations

By Use Case

Legal Document Processing (contracts, patents, briefs):

Best: 128K or 64K vocab
Rationale: Maximum compression of legal terminology
Benefit: Shorter sequences, faster inference

Resource-Constrained Environments:

Best: 16K or 32K vocab
Rationale: Good balance of compression and model size
Benefit: Smaller embedding layers, less memory

Experimentation / Research:

Best: Multiple sizes with vocabulary expansion
Rationale: Leverage nested structure for novel training strategies
Benefit: Test curriculum learning, transfer learning

Model Size Guidelines

Choose vocab size based on your model parameter count:

Model Size	Recommended Vocab	Embedding Parameters
<200M params	4K-8K	3-6M
200M-500M params	8K-16K	6-13M
500M-1B params	16K-32K	13-26M
1B-3B params	32K-64K	26-52M
>3B params	64K-128K	52-104M

Limitations

Training domain: Optimized for legal English text; may underperform on other domains
Multilingual: Trained primarily on English; limited non-English support
Code: Less optimized for code compared to code-specific tokenizers
Vocabulary size: Larger vocabs (64K+) require more embedding memory

Citation

If you use these tokenizers in your research, please cite:

@misc{kl3m-multi-word-002,
  title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain},
  author={ALEA Institute},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/alea-institute/kl3m-multi-word-002-4k}
}

License

MIT License

About ALEA Institute

The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers.

Related Resources

KL3M Dataset: aleainstitute.ai/work/kl3m
bbpe Tokenizer Trainer: github.com/microsoft/blingfire
v1 Tokenizers: alea-institute/kl3m-multi-word-001-*

Version History

v2 (002 series): Improved special token design, better legal optimization
v1 (001 series): Initial release with 10 special tokens

Contact

For questions or issues, please contact: [contact information TBD]

Downloads last month: -; Downloads are not tracked for this model. How to track