alea-institute commited on
Commit
bc5bc26
·
verified ·
1 Parent(s): b599178

Upload KL3M multi-word tokenizer v2 (128K) - Update README

Browse files
Files changed (1) hide show
  1. README.md +218 -116
README.md CHANGED
@@ -1,199 +1,301 @@
1
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  library_name: transformers
3
- tags: []
4
  ---
5
 
6
- # Model Card for Model ID
7
 
8
- <!-- Provide a quick summary of what the model is/does. -->
9
 
 
10
 
 
11
 
12
- ## Model Details
 
 
 
13
 
14
- ### Model Description
15
 
16
- <!-- Provide a longer summary of what this model is. -->
 
 
 
17
 
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
 
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
 
28
- ### Model Sources [optional]
 
 
 
 
 
 
 
29
 
30
- <!-- Provide the basic links for the model. -->
31
 
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
 
36
- ## Uses
 
 
 
 
 
 
 
37
 
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
 
40
- ### Direct Use
41
 
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
 
 
 
 
 
 
 
43
 
44
- [More Information Needed]
45
 
46
- ### Downstream Use [optional]
47
 
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
 
50
- [More Information Needed]
51
 
52
- ### Out-of-Scope Use
 
 
53
 
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
 
 
55
 
56
- [More Information Needed]
 
 
57
 
58
- ## Bias, Risks, and Limitations
59
 
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
 
62
- [More Information Needed]
 
 
 
63
 
64
- ### Recommendations
65
 
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
 
 
 
 
 
67
 
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
 
 
 
69
 
70
- ## How to Get Started with the Model
71
 
72
- Use the code below to get started with the model.
73
 
74
- [More Information Needed]
 
 
 
 
 
 
 
 
75
 
76
- ## Training Details
77
 
78
- ### Training Data
79
 
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
 
82
- [More Information Needed]
 
83
 
84
- ### Training Procedure
 
 
 
85
 
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
 
 
 
87
 
88
- #### Preprocessing [optional]
 
 
 
89
 
90
- [More Information Needed]
91
 
 
 
92
 
93
- #### Training Hyperparameters
 
 
 
94
 
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
 
 
 
 
96
 
97
- #### Speeds, Sizes, Times [optional]
98
 
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
 
100
 
101
- [More Information Needed]
 
 
 
102
 
103
- ## Evaluation
 
 
 
 
104
 
105
- <!-- This section describes the evaluation protocols and provides the results. -->
 
106
 
107
- ### Testing Data, Factors & Metrics
 
108
 
109
- #### Testing Data
110
 
111
- <!-- This should link to a Dataset Card if possible. -->
112
 
113
- [More Information Needed]
 
 
 
 
114
 
115
- #### Factors
116
 
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
 
 
 
 
 
 
 
 
118
 
119
- [More Information Needed]
 
 
120
 
121
- #### Metrics
122
 
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
 
 
124
 
125
- [More Information Needed]
126
 
127
- ### Results
128
 
129
- [More Information Needed]
 
 
 
130
 
131
- #### Summary
 
 
 
132
 
 
 
 
 
133
 
 
134
 
135
- ## Model Examination [optional]
136
 
137
- <!-- Relevant interpretability work for the model goes here -->
 
 
 
 
 
 
138
 
139
- [More Information Needed]
140
 
141
- ## Environmental Impact
 
 
 
142
 
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
 
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
 
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
 
 
 
 
152
 
153
- ## Technical Specifications [optional]
154
 
155
- ### Model Architecture and Objective
156
 
157
- [More Information Needed]
158
 
159
- ### Compute Infrastructure
160
 
161
- [More Information Needed]
162
 
163
- #### Hardware
 
 
164
 
165
- [More Information Needed]
166
 
167
- #### Software
 
168
 
169
- [More Information Needed]
170
 
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
  ---
2
+ language:
3
+ - en
4
+ license: mit
5
+ tags:
6
+ - tokenizer
7
+ - legal
8
+ - bpe
9
+ - byte-pair-encoding
10
+ - multi-word
11
+ - kl3m
12
+ - legal-domain
13
+ - hierarchical
14
+ pipeline_tag: fill-mask
15
  library_name: transformers
 
16
  ---
17
 
18
+ # KL3M Multi-Word Tokenizer v2 - 128K
19
 
20
+ This is the **131,072 token** variant of the KL3M (Kelvin Legal Large Language Model) multi-word tokenizer family v2, optimized for legal domain text with hierarchical vocabulary nesting.
21
 
22
+ ## Overview
23
 
24
+ The KL3M multi-word tokenizers v2 are an improved family of byte-pair encoding (BPE) tokenizers trained on ~44GB of legal domain text from the [KL3M dataset](https://aleainstitute.ai/work/kl3m/) (copyright-clean legal corpus from the ALEA Institute). These tokenizers:
25
 
26
+ - **Capture multi-word legal phrases** as single tokens (e.g., "Licensee", "hereinafter", "indemnification")
27
+ - **Use hierarchical vocabulary nesting** where smaller vocabularies are proper subsets of larger ones
28
+ - **Outperform GPT-4 on legal text** with 7.5% better compression on legal documents
29
+ - **Enable vocabulary expansion experiments** and transfer learning across vocabulary sizes
30
 
31
+ ## What's New in v2
32
 
33
+ - **Cleaner special token design**: 7 special tokens (removed experimental symbols)
34
+ - **Improved legal domain optimization**: Better encoding of common legal terminology
35
+ - **Superior compression**: 5.32 chars/token on legal text (vs 4.92 for GPT-4)
36
+ - **Smaller file sizes**: More efficient tokenizer representation
37
 
38
+ ## Performance Comparison
39
 
40
+ On a realistic 3,743-character legal document (Software License Agreement):
 
 
 
 
 
 
41
 
42
+ | Tokenizer | Vocab Size | Tokens | Chars/Token | vs GPT-4 |
43
+ |-----------|------------|--------|-------------|----------|
44
+ | **KL3M v2-128K** | 131,072 | **704** | **5.32** | **-7.5%** |
45
+ | GPT-4o/5 | 200,019 | 757 | 4.94 | +0.5% |
46
+ | GPT-4 | 100,277 | 761 | 4.92 | baseline |
47
+ | GPT-2 | 50,257 | 858 | 4.36 | +12.7% |
48
+ | KL3M v2-64K | 65,536 | 802 | 4.67 | +5.4% |
49
+ | KL3M v2-32K | 32,768 | 943 | 3.97 | +23.9% |
50
 
51
+ ### Legal Terminology Efficiency
52
 
53
+ Common legal terms as single tokens (128K vocab):
 
 
54
 
55
+ | Term | KL3M v2-128K | GPT-4 | GPT-4o/5 |
56
+ |------|--------------|-------|----------|
57
+ | "Licensee" | 1 token | 2 tokens | 2 tokens |
58
+ | "hereinafter" | 1 token | 3 tokens | 3 tokens |
59
+ | "indemnification" | 1 token | 4 tokens | 3 tokens |
60
+ | "arbitration" | 1 token | 3 tokens | 3 tokens |
61
+ | "WHEREAS" | 1 token | 2 tokens | 2 tokens |
62
+ | "non-exclusive" | 1 token | 2 tokens | 2 tokens |
63
 
64
+ ## Tokenizer Family
65
 
66
+ This tokenizer is part of a hierarchically nested family. Token IDs in smaller vocabularies are **identical** across all larger vocabularies, enabling seamless vocabulary expansion:
67
 
68
+ | Vocabulary Size | HuggingFace Repository | File Size |
69
+ |----------------|------------------------|-----------|
70
+ | 4,096 (4K) | [alea-institute/kl3m-multi-word-002-4k](https://huggingface.co/alea-institute/kl3m-multi-word-002-4k) | 248 KB |
71
+ | 8,192 (8K) | [alea-institute/kl3m-multi-word-002-8k](https://huggingface.co/alea-institute/kl3m-multi-word-002-8k) | 516 KB |
72
+ | 16,384 (16K) | [alea-institute/kl3m-multi-word-002-16k](https://huggingface.co/alea-institute/kl3m-multi-word-002-16k) | 1.1 MB |
73
+ | 32,768 (32K) | [alea-institute/kl3m-multi-word-002-32k](https://huggingface.co/alea-institute/kl3m-multi-word-002-32k) | 2.1 MB |
74
+ | 65,536 (64K) | [alea-institute/kl3m-multi-word-002-64k](https://huggingface.co/alea-institute/kl3m-multi-word-002-64k) | 4.4 MB |
75
+ | 131,072 (128K) | [alea-institute/kl3m-multi-word-002-128k](https://huggingface.co/alea-institute/kl3m-multi-word-002-128k) | 8.9 MB |
76
 
77
+ **→ You are viewing: 131,072 (128K)**
78
 
79
+ ## Key Features
80
 
81
+ ### 1. Multi-Word Tokenization
82
 
83
+ Legal text contains frequent multi-word phrases that benefit from being treated as single tokens. The larger vocabularies capture increasingly sophisticated legal terminology:
84
 
85
+ **4K vocabulary examples:**
86
+ - Common legal particles: "herein", "thereof", "pursuant"
87
+ - Basic legal terms: "shall", "party", "agreement"
88
 
89
+ **32K vocabulary examples:**
90
+ - Complex terms: "Licensee", "Licensor", "jurisdiction", "arbitration"
91
+ - Multi-word phrases: "intellectual property", "force majeure"
92
 
93
+ **128K vocabulary examples:**
94
+ - Specialized terms: "indemnification", "confidentiality", "non-exclusive"
95
+ - Complete phrases: "representations and warranties", "WHEREAS"
96
 
97
+ ### 2. Hierarchical Vocabulary Nesting
98
 
99
+ Token IDs 0-4,095 are **identical** across all tokenizer sizes. This enables:
100
 
101
+ - **Vocabulary expansion during training**: Start with 4K vocab, expand to 32K mid-training
102
+ - **Transfer learning**: Initialize larger vocab models from smaller vocab checkpoints
103
+ - **Controlled ablations**: Compare vocab sizes while maintaining token alignment
104
+ - **Model compression**: Train with large vocab, deploy with smaller vocab
105
 
106
+ ### 3. Legal Domain Optimization
107
 
108
+ Trained on the KL3M corpus (44GB of legal text):
109
+ - Court opinions and case law
110
+ - Contracts and agreements
111
+ - Patents and IP documents
112
+ - Legal briefs and filings
113
+ - Statutory and regulatory text
114
 
115
+ This specialized training produces:
116
+ - **Better compression** on legal documents (5.32 chars/token vs 4.92 for GPT-4)
117
+ - **Semantic coherence** for legal multi-word expressions
118
+ - **Reduced sequence lengths** leading to faster inference
119
 
120
+ ### 4. Special Tokens (v2)
121
 
122
+ Seven essential special tokens for language model training:
123
 
124
+ | Token | ID | Purpose |
125
+ |-------|----|--------- |
126
+ | `<\|start\|>` | 0 | Start of sequence |
127
+ | `<\|end\|>` | 1 | End of sequence |
128
+ | `<\|pad\|>` | 2 | Padding token |
129
+ | `<\|unk\|>` | 3 | Unknown token |
130
+ | `<\|cls\|>` | 4 | Classification (BERT) |
131
+ | `<\|sep\|>` | 5 | Separator (BERT) |
132
+ | `<\|mask\|>` | 6 | Mask token (MLM) |
133
 
134
+ *Note: v2 removes experimental symbols (⧈, ⚖, ⏵) from v1 for cleaner design.*
135
 
136
+ ## Usage
137
 
138
+ ### With Transformers
139
 
140
+ ```python
141
+ from transformers import PreTrainedTokenizerFast
142
 
143
+ # Load tokenizer
144
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
145
+ "alea-institute/kl3m-multi-word-002-128k"
146
+ )
147
 
148
+ # Tokenize text
149
+ text = "The Licensor hereby grants to Licensee a non-exclusive license."
150
+ tokens = tokenizer.encode(text)
151
+ print(f"Tokens: {len(tokens)}")
152
 
153
+ # Decode
154
+ decoded = tokenizer.decode(tokens)
155
+ print(f"Decoded: {decoded}")
156
+ ```
157
 
158
+ ### With tokenizers Library
159
 
160
+ ```python
161
+ from tokenizers import Tokenizer
162
 
163
+ # Load tokenizer
164
+ tokenizer = Tokenizer.from_pretrained(
165
+ "alea-institute/kl3m-multi-word-002-128k"
166
+ )
167
 
168
+ # Encode
169
+ encoding = tokenizer.encode(text)
170
+ print(f"Tokens: {encoding.tokens}")
171
+ print(f"IDs: {encoding.ids}")
172
+ ```
173
 
174
+ ### Training a Model
175
 
176
+ ```python
177
+ from transformers import AutoConfig, AutoModelForMaskedLM, PreTrainedTokenizerFast
178
 
179
+ # Load tokenizer
180
+ tokenizer = PreTrainedTokenizerFast.from_pretrained(
181
+ "alea-institute/kl3m-multi-word-002-128k"
182
+ )
183
 
184
+ # Create model config
185
+ config = AutoConfig.from_pretrained(
186
+ "bert-base-uncased",
187
+ vocab_size=tokenizer.vocab_size,
188
+ )
189
 
190
+ # Initialize model
191
+ model = AutoModelForMaskedLM.from_config(config)
192
 
193
+ # Train with HuggingFace Trainer...
194
+ ```
195
 
196
+ ## Technical Details
197
 
198
+ ### Training Corpus
199
 
200
+ - **Source**: KL3M (Kelvin Legal Large Language Model) dataset
201
+ - **Size**: ~44.2 GB (44,168,540,153 bytes)
202
+ - **Lines**: 1,018,355,750
203
+ - **Words**: 5,997,814,602
204
+ - **Domain**: Legal documents (copyright-clean)
205
 
206
+ ### Training Parameters
207
 
208
+ ```bash
209
+ bbpe train \
210
+ --max-entropy 7.0 \
211
+ --preprocessor unicode-whitespace \
212
+ --preprocessor-probability 0.1 \
213
+ --vocab-size 131072 \
214
+ --family-size 65536 --family-size 32768 \
215
+ --family-size 16384 --family-size 8192 --family-size 4096
216
+ ```
217
 
218
+ - **Max entropy**: 7.0 (balances multi-word phrases with common tokens)
219
+ - **Preprocessing**: Unicode whitespace normalization (10% probability)
220
+ - **Byte fallback**: Enabled (handles any input)
221
 
222
+ ### Vocabulary Structure
223
 
224
+ - **Base vocabulary**: 256 bytes + 49 extended chars = 305 base tokens
225
+ - **Learned merges**: vocab_size - 305 - 7 (special tokens)
226
+ - **Nesting property**: All tokens in size N exist in size 2N
227
 
228
+ ## Recommendations
229
 
230
+ ### By Use Case
231
 
232
+ **Legal Document Processing** (contracts, patents, briefs):
233
+ - **Best**: 128K or 64K vocab
234
+ - **Rationale**: Maximum compression of legal terminology
235
+ - **Benefit**: Shorter sequences, faster inference
236
 
237
+ **Resource-Constrained Environments**:
238
+ - **Best**: 16K or 32K vocab
239
+ - **Rationale**: Good balance of compression and model size
240
+ - **Benefit**: Smaller embedding layers, less memory
241
 
242
+ **Experimentation / Research**:
243
+ - **Best**: Multiple sizes with vocabulary expansion
244
+ - **Rationale**: Leverage nested structure for novel training strategies
245
+ - **Benefit**: Test curriculum learning, transfer learning
246
 
247
+ ### Model Size Guidelines
248
 
249
+ Choose vocab size based on your model parameter count:
250
 
251
+ | Model Size | Recommended Vocab | Embedding Parameters |
252
+ |------------|-------------------|----------------------|
253
+ | <200M params | 4K-8K | 3-6M |
254
+ | 200M-500M params | 8K-16K | 6-13M |
255
+ | 500M-1B params | 16K-32K | 13-26M |
256
+ | 1B-3B params | 32K-64K | 26-52M |
257
+ | >3B params | 64K-128K | 52-104M |
258
 
259
+ ## Limitations
260
 
261
+ - **Training domain**: Optimized for legal English text; may underperform on other domains
262
+ - **Multilingual**: Trained primarily on English; limited non-English support
263
+ - **Code**: Less optimized for code compared to code-specific tokenizers
264
+ - **Vocabulary size**: Larger vocabs (64K+) require more embedding memory
265
 
266
+ ## Citation
267
 
268
+ If you use these tokenizers in your research, please cite:
269
 
270
+ ```bibtex
271
+ @misc{kl3m-multi-word-002,
272
+ title={KL3M Multi-Word Tokenizers v2: Hierarchically Nested BPE for Legal Domain},
273
+ author={ALEA Institute},
274
+ year={2025},
275
+ publisher={HuggingFace},
276
+ url={https://huggingface.co/alea-institute/kl3m-multi-word-002-128k}
277
+ }
278
+ ```
279
 
280
+ ## License
281
 
282
+ MIT License
283
 
284
+ ## About ALEA Institute
285
 
286
+ The [ALEA Institute](https://aleainstitute.ai) develops open-source tools and datasets for legal AI, including the KL3M corpus and multi-word tokenizers.
287
 
288
+ ## Related Resources
289
 
290
+ - **KL3M Dataset**: [aleainstitute.ai/work/kl3m](https://aleainstitute.ai/work/kl3m/)
291
+ - **bbpe Tokenizer Trainer**: [github.com/microsoft/blingfire](https://github.com/microsoft/blingfire)
292
+ - **v1 Tokenizers**: [alea-institute/kl3m-multi-word-001-*](https://huggingface.co/alea-institute/kl3m-multi-word-001-4k)
293
 
294
+ ## Version History
295
 
296
+ - **v2 (002 series)**: Improved special token design, better legal optimization
297
+ - **v1 (001 series)**: Initial release with 10 special tokens
298
 
299
+ ## Contact
300
 
301
+ For questions or issues, please contact: [contact information TBD]