KL3M Multi-Word Tokenizer v2 - 4K

This is the 4,096 token variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting.

Overview

The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the KL3M dataset (copyright-clean legal corpus from the ALEA Institute). These tokenizers:

  • Capture multi-word phrases as single tokens (e.g., "United States", "set forth", "accordance with")
  • Encode complex legal terms efficiently (e.g., "Licensee", "hereinafter", "indemnification" as single tokens)
  • Use hierarchical vocabulary nesting where smaller vocabularies are proper subsets of larger ones
  • Outperform GPT-4 on legal text with 7.5% better compression on legal documents
  • Enable vocabulary expansion experiments and transfer learning across vocabulary sizes

What's New in v2

  • Cleaner special token design: 7 special tokens (removed experimental symbols)
  • Improved legal domain optimization: Better encoding of common legal terminology
  • Superior compression: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
  • Smaller file sizes: More efficient tokenizer representation

Multi-Word Tokenization Examples

What is multi-word tokenization? Unlike standard tokenizers that split text into subword pieces, these tokenizers capture multiple words as single tokens. This is especially powerful for legal text with common multi-word phrases.

Example: "The United States Department of Transportation is responsible for"

4K vocabulary (16 tokens):

[The][ ][United][ ][States][ ][Department][ of][ ][Trans][port][ation][ is][ ][responsible][ for]

32K vocabulary (10 tokens):

[The][ United][ States][ Department][ of][ ][Transportation][ is][ responsible][ for]

128K vocabulary (8 tokens) - "United States" is ONE token!

[The][ United States][ Department][ of][ Transportation][ is][ responsible][ for]

Example: "In accordance with the terms and conditions set forth herein"

4K vocabulary (16 tokens):

[In][ ][accordance][ with][ the][ ][terms][ and][ con][ditions][ ][set][ for][th][ ][herein]

32K vocabulary (10 tokens) - "accordance with" is ONE token!

[In][ ][accordance with][ the][ terms][ and][ conditions][ set][ forth][ herein]

128K vocabulary (8 tokens) - "accordance with" and "set forth" are single tokens!

[In][ accordance with][ the][ terms][ and][ conditions][ set forth][ herein]

Example: "The Supreme Court of the United States held that"

4K vocabulary (17 tokens):

[The][ S][up][re][me][ C][our][t][ of][ the][ ][United][ ][States][ h][eld][ that]

32K vocabulary (9 tokens):

[The][ Sup][reme][ Court][ of the][ United][ States][ held][ that]

128K vocabulary (7 tokens) - "United States" is ONE token!

[The][ Supreme][ Court][ of the][ United States][ held][ that]

Why This Matters

  1. Shorter sequences = faster inference and training
  2. Semantic coherence = "United States" as one unit, not two separate words
  3. Better legal understanding = common legal phrases encoded atomically
  4. Efficient compression = 7.5% fewer tokens than GPT-4 on legal text

Performance Comparison

On a realistic 3,743-character legal document (Software License Agreement):

Tokenizer Vocab Size Tokens Chars/Token vs GPT-4
KL3M v2-128K 131,072 704 5.32 -7.5%
GPT-4o/5 200,019 757 4.94 +0.5%
GPT-4 100,277 761 4.92 baseline
GPT-2 50,257 858 4.36 +12.7%
KL3M v2-64K 65,536 802 4.67 +5.4%
KL3M v2-32K 32,768 943 3.97 +23.9%

Legal Terminology Efficiency

Common legal terms as single tokens (128K vocab):

Term KL3M v2-128K GPT-4 GPT-4o/5
"Licensee" 1 token 2 tokens 2 tokens
"hereinafter" 1 token 3 tokens 3 tokens
"indemnification" 1 token 4 tokens 3 tokens
"arbitration" 1 token 3 tokens 3 tokens
"WHEREAS" 1 token 2 tokens 2 tokens
"non-exclusive" 1 token 2 tokens 2 tokens

Tokenizer Family

This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are identical across all larger vocabularies, enabling seamless vocabulary expansion:

Vocabulary Size HuggingFace Repository File Size
4,096 (4K) alea-institute/kl3m-multi-word-002-4k 248 KB
8,192 (8K) alea-institute/kl3m-multi-word-002-8k 516 KB
16,384 (16K) alea-institute/kl3m-multi-word-002-16k 1.1 MB
32,768 (32K) alea-institute/kl3m-multi-word-002-32k 2.1 MB
65,536 (64K) alea-institute/kl3m-multi-word-002-64k 4.4 MB
131,072 (128K) alea-institute/kl3m-multi-word-002-128k 8.9 MB

โ†’ You are viewing: 4,096 (4K)

Key Features

1. Multi-Word Tokenization

Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:

4K vocabulary examples:

  • Common legal particles: "herein", "thereof", "pursuant"
  • Basic legal terms: "shall", "party", "agreement"

32K vocabulary examples:

  • Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration"
  • Multi-word phrases: "intellectual property", "force majeure"

128K vocabulary examples:

  • Specialized terms: "indemnification", "confidentiality", "non-exclusive"
  • Complete phrases: "representations and warranties", "WHEREAS"

2. Hierarchical Vocabulary Nesting

Token IDs 0-4,095 are identical across all tokenizer sizes. This enables:

  • Vocabulary expansion during training: Start with 4K vocab, expand to 32K mid-training
  • Transfer learning: Initialize larger vocab models from smaller vocab checkpoints
  • Controlled ablations: Compare vocab sizes while maintaining token alignment
  • Model compression: Train with large vocab, deploy with smaller vocab

3. Legal Domain Optimization

Trained on the KL3M corpus (44GB of legal text):

  • Court opinions and case law
  • Contracts and agreements
  • Patents and IP documents
  • Legal briefs and filings
  • Statutory and regulatory text

This specialized training produces:

  • Better compression on legal documents (5.32 chars/token vs 4.92 for GPT-4)
  • Semantic coherence for legal multi-word expressions
  • Reduced sequence lengths leading to faster inference

4. Special Tokens (v2)

Seven essential special tokens for language model training:

Token ID Purpose
<|start|> 0 Start of sequence
<|end|> 1 End of sequence
<|pad|> 2 Padding token
<|unk|> 3 Unknown token
<|cls|> 4 Classification (BERT)
<|sep|> 5 Separator (BERT)
<|mask|> 6 Mask token (MLM)

Note: v2 removes experimental symbols (โงˆ, โš–, โต) from v1 for cleaner design.

Usage

With Transformers

from transformers import PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Tokenize text
text = "The Licensor hereby grants to Licensee a non-exclusive license."
tokens = tokenizer.encode(text)
print(f"Tokens: {len(tokens)}")

# Decode
decoded = tokenizer.decode(tokens)
print(f"Decoded: {decoded}")

With tokenizers Library

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Encode
encoding = tokenizer.encode(text)
print(f"Tokens: {encoding.tokens}")
print(f"IDs: {encoding.ids}")

Training a Model

from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast

# Load tokenizer
tokenizer = PreTrainedTokenizerFast.from_pretrained(
    "alea-institute/kl3m-multi-word-002-4k"
)

# Create model config
config = AutoConfig.from_pretrained(
    "bert-base-uncased",
    vocab_size=tokenizer.vocab_size,
)

# Initialize model
model = AutoModelForMaskedLM.from_config(config)

# Train with HuggingFace Trainer...

Technical Details

Training Corpus

  • Source: KL3M (Kelvin Legal Large Language Model) dataset
  • Size: ~44.2 GB (44,168,540,153 bytes)
  • Lines: 1,018,355,750
  • Words: 5,997,814,602
  • Domain: Legal documents (copyright-clean)

Training Parameters

bbpe train \
  --max-entropy 7.0 \
  --preprocessor unicode-whitespace \
  --preprocessor-probability 0.1 \
  --vocab-size 131072 \
  --family-size 65536 --family-size 32768 \
  --family-size 16384 --family-size 8192 --family-size 4096
  • Max entropy: 7.0 (balances multi-word phrases with common tokens)
  • Preprocessing: Unicode whitespace normalization (10% probability)
  • Byte fallback: Enabled (handles any input)

Vocabulary Structure

  • Base vocabulary: 256 bytes + 49 extended chars = 305 base tokens
  • Learned merges: vocab_size - 305 - 7 (special tokens)
  • Nesting property: All tokens in size N exist in size 2N

Recommendations

By Use Case

Legal Document Processing (contracts, patents, briefs):

  • Best: 128K or 64K vocab
  • Rationale: Maximum compression of legal terminology
  • Benefit: Shorter sequences, faster inference

Resource-Constrained Environments:

  • Best: 16K or 32K vocab
  • Rationale: Good balance of compression and model size
  • Benefit: Smaller embedding layers, less memory

Experimentation / Research:

  • Best: Multiple sizes with vocabulary expansion
  • Rationale: Leverage nested structure for novel training strategies
  • Benefit: Test curriculum learning, transfer learning

Model Size Guidelines

Choose vocab size based on your model parameter count:

Model Size Recommended Vocab Embedding Parameters
<200M params 4K-8K 3-6M
200M-500M params 8K-16K 6-13M
500M-1B params 16K-32K 13-26M
1B-3B params 32K-64K 26-52M
>3B params 64K-128K 52-104M

Limitations

  • Training domain: Optimized for legal English text; may underperform on other domains
  • Multilingual: Trained primarily on English; limited non-English support
  • Code: Less optimized for code compared to code-specific tokenizers
  • Vocabulary size: Larger vocabs (64K+) require more embedding memory

Citation

If you use these tokenizers in your research, please cite:

@misc{kl3m-multi-word-002,
  title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain},
  author={ALEA Institute},
  year={2025},
  publisher={HuggingFace},
  url={https://huggingface.co/alea-institute/kl3m-multi-word-002-4k}
}

License

MIT License

About ALEA Institute

The ALEA Institute develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers.

Related Resources

Version History

  • v2 (002 series): Improved special token design, better legal optimization
  • v1 (001 series): Initial release with 10 special tokens

Contact

For questions or issues, please contact: [contact information TBD]

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support