mjbommar commited on
Commit
90f419e
·
verified ·
1 Parent(s): 86ee340

Upload binary-tokenizer-001-8k tokenizer

Browse files
Files changed (4) hide show
  1. .gitattributes +2 -35
  2. README.md +217 -0
  3. analysis_results.json +140 -0
  4. tokenizer.json +0 -0
.gitattributes CHANGED
@@ -1,35 +1,2 @@
1
- *.7z filter=lfs diff=lfs merge=lfs -text
2
- *.arrow filter=lfs diff=lfs merge=lfs -text
3
- *.bin filter=lfs diff=lfs merge=lfs -text
4
- *.bz2 filter=lfs diff=lfs merge=lfs -text
5
- *.ckpt filter=lfs diff=lfs merge=lfs -text
6
- *.ftz filter=lfs diff=lfs merge=lfs -text
7
- *.gz filter=lfs diff=lfs merge=lfs -text
8
- *.h5 filter=lfs diff=lfs merge=lfs -text
9
- *.joblib filter=lfs diff=lfs merge=lfs -text
10
- *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
- *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
- *.model filter=lfs diff=lfs merge=lfs -text
13
- *.msgpack filter=lfs diff=lfs merge=lfs -text
14
- *.npy filter=lfs diff=lfs merge=lfs -text
15
- *.npz filter=lfs diff=lfs merge=lfs -text
16
- *.onnx filter=lfs diff=lfs merge=lfs -text
17
- *.ot filter=lfs diff=lfs merge=lfs -text
18
- *.parquet filter=lfs diff=lfs merge=lfs -text
19
- *.pb filter=lfs diff=lfs merge=lfs -text
20
- *.pickle filter=lfs diff=lfs merge=lfs -text
21
- *.pkl filter=lfs diff=lfs merge=lfs -text
22
- *.pt filter=lfs diff=lfs merge=lfs -text
23
- *.pth filter=lfs diff=lfs merge=lfs -text
24
- *.rar filter=lfs diff=lfs merge=lfs -text
25
- *.safetensors filter=lfs diff=lfs merge=lfs -text
26
- saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
- *.tar.* filter=lfs diff=lfs merge=lfs -text
28
- *.tar filter=lfs diff=lfs merge=lfs -text
29
- *.tflite filter=lfs diff=lfs merge=lfs -text
30
- *.tgz filter=lfs diff=lfs merge=lfs -text
31
- *.wasm filter=lfs diff=lfs merge=lfs -text
32
- *.xz filter=lfs diff=lfs merge=lfs -text
33
- *.zip filter=lfs diff=lfs merge=lfs -text
34
- *.zst filter=lfs diff=lfs merge=lfs -text
35
- *tfevents* filter=lfs diff=lfs merge=lfs -text
 
1
+ *.json filter=lfs diff=lfs merge=lfs -text
2
+ *.txt filter=lfs diff=lfs merge=lfs -text
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
README.md ADDED
@@ -0,0 +1,217 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - code
4
+ tags:
5
+ - tokenizer
6
+ - binary-analysis
7
+ - binary-tokenization
8
+ - bpe
9
+ - byte-pair-encoding
10
+ - reverse-engineering
11
+ - malware-analysis
12
+ - cybersecurity
13
+ - executable-analysis
14
+ license: mit
15
+ pipeline_tag: feature-extraction
16
+ library_name: tokenizers
17
+ ---
18
+
19
+ # binary-tokenizer-001-8k
20
+
21
+ A cross-platform BPE tokenizer for binary executables and machine code. Trained on 13 GB of diverse binaries spanning Linux, Windows, macOS, and Android platforms.
22
+
23
+ **🔗 Model**: [`mjbommar/binary-tokenizer-001-8k`](https://huggingface.co/mjbommar/binary-tokenizer-001-8k)
24
+ **📊 Dataset**: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
25
+ **📄 Paper**: *Binary BPE: Cross-Platform Tokenization for Binary Analysis* (arXiv preprint coming soon)
26
+
27
+ ## Overview
28
+
29
+ - **Vocabulary Size**: 8,192 tokens (2^13)
30
+ - **Token Composition**: 256 base bytes + 7,929 learned merges + 7 special tokens
31
+ - **Average Token Length**: 3.312 bytes
32
+ - **3-byte Instructions**: 21.7% of vocabulary (1,774 tokens)
33
+ - **Compression Ratio**: ~2.2 bytes/token on typical binaries
34
+
35
+ ---
36
+
37
+ ## Training Configuration
38
+
39
+ **Training Corpus**:
40
+ - Source: [`mjbommar/binary-30k-tokenized`](https://huggingface.co/datasets/mjbommar/binary-30k-tokenized)
41
+ - Size: ~13 GB
42
+ - Files: 30,738 binary files
43
+ - Platforms: Linux (ELF), Windows (PE), macOS (Mach-O), Android (APK)
44
+ - Architectures: x86-64, x86, ARM64, ARM, MIPS, RISC-V
45
+
46
+ **Training Parameters**:
47
+ - Vocabulary size: 8,192 (including 7 special tokens)
48
+ - Min frequency: 10
49
+ - Chunk size: 8,192 bytes
50
+ - Allowed lengths: DEFAULT (1-16 bytes)
51
+ - Training duration: ~2-3 hours
52
+
53
+ ---
54
+
55
+ ## Vocabulary Statistics
56
+
57
+ **Composition**:
58
+ - Base bytes (0-255): 256 tokens
59
+ - Learned merges: 7,929 tokens
60
+ - Special tokens: 7 tokens (`<|start|>`, `<|end|>`, `<|pad|>`, `<|unk|>`, `<|cls|>`, `<|sep|>`, `<|mask|>`)
61
+ - **Total**: 8,192 tokens
62
+
63
+ **Quality Metrics**:
64
+ - All tokens reachable: ✓ Yes
65
+ - Valid merges: 7,929 / 7,929
66
+ - Power-of-2 size: ✓ Yes (2^13)
67
+
68
+ ---
69
+
70
+ ## Token Length Distribution
71
+
72
+ | Length | Count | Percentage | Description |
73
+ |--------|-------|------------|-------------|
74
+ | 1 byte | 256 | 3.1% | Base bytes |
75
+ | 2 bytes | 3,659 | 44.7% | Byte pairs |
76
+ | 3 bytes | 1,774 | 21.7% | Complete x86-64 instructions |
77
+ | 4 bytes | 1,464 | 17.9% | Instructions with operands |
78
+ | 5 bytes | 290 | 3.5% | Complex patterns |
79
+ | 6 bytes | 245 | 3.0% | Complex patterns |
80
+ | 7 bytes | 96 | 1.2% | Complex patterns |
81
+ | 8 bytes | 146 | 1.8% | Complex patterns |
82
+ | 9+ bytes | 262 | 3.2% | Long patterns |
83
+
84
+ **Average Token Length**: 3.312 bytes
85
+
86
+ ---
87
+
88
+ ## Byte Content Analysis
89
+
90
+ **Content Categories**:
91
+ - Contains NULL byte (0x00): 2,145 tokens (26.2%)
92
+ - ASCII printable (0x20-0x7E): 1,831 tokens (22.4%)
93
+ - All ASCII (<0x80): 3,810 tokens (46.5%)
94
+ - High bytes (≥0x80): 4,375 tokens (53.5%)
95
+
96
+ **Most Common Bytes in Tokens**:
97
+ - `0x00` (NULL): 5,087 occurrences - Padding and alignment
98
+ - `0xFF`: 829 occurrences - Sentinel values
99
+ - `0x48` (REX.W): 684 occurrences - x86-64 REX prefix
100
+ - `0x8B` (MOV): 505 occurrences - x86-64 MOV opcode
101
+ - `0xCC` (INT3): 347 occurrences - Debug breakpoint padding
102
+
103
+ ---
104
+
105
+ ## Sequence Coverage
106
+
107
+ **N-byte Sequence Diversity**:
108
+ | Length | Learned Tokens | Possible Sequences | Coverage |
109
+ |--------|----------------|-------------------|----------|
110
+ | 1-byte | 256 | 256 | 100.00% |
111
+ | 2-byte | 3,659 | 65,536 | 5.58% |
112
+ | 3-byte | 1,774 | 16,777,216 | 0.011% |
113
+ | 4-byte | 1,464 | 4,294,967,296 | 0.000034% |
114
+
115
+ ---
116
+
117
+ ## Files
118
+
119
+ - `tokenizer-8192.json` - Trained tokenizer model (592 KB)
120
+ - `analysis_results.json` - Detailed analysis statistics
121
+ - `training.log` - Training output log (if available)
122
+ - `training_stats.txt` - Training summary (if available)
123
+
124
+ ---
125
+
126
+ ## Usage
127
+
128
+ **Load from HuggingFace Hub**:
129
+ ```python
130
+ from tokenizers import Tokenizer
131
+
132
+ # Load directly from HuggingFace
133
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-8k")
134
+ ```
135
+
136
+ **Load from local file**:
137
+ ```bash
138
+ # With bbpe CLI
139
+ bbpe encode --tokenizer tokenizer-8192.json /path/to/binary
140
+ bbpe info tokenizer-8192.json
141
+ ```
142
+
143
+ **Complete Python Example**:
144
+ ```python
145
+ from tokenizers import Tokenizer
146
+
147
+ # Load from HuggingFace or local file
148
+ tokenizer = Tokenizer.from_pretrained("mjbommar/binary-tokenizer-001-8k")
149
+ # OR: tokenizer = Tokenizer.from_file("tokenizer-8192.json")
150
+
151
+ # Read binary file and decode as latin-1 (preserves all byte values 0-255)
152
+ with open("/usr/bin/ls", "rb") as f:
153
+ data = f.read()
154
+ data_str = data.decode("latin-1")
155
+
156
+ # Encode the binary data
157
+ encoding = tokenizer.encode(data_str)
158
+ print(f"File size: {len(data)} bytes")
159
+ print(f"Total tokens: {len(encoding.ids)}")
160
+ print(f"Compression: {len(data) / len(encoding.ids):.3f} bytes/token")
161
+
162
+ # First 10 tokens
163
+ for i, (token_id, token) in enumerate(zip(encoding.ids[:10], encoding.tokens[:10])):
164
+ token_bytes = token.encode("latin-1")
165
+ print(f" Token {i}: ID={token_id:5d} hex={token_bytes.hex():20s} ({len(token_bytes)} bytes)")
166
+
167
+ # Decode tokens back to bytes
168
+ decoded_str = tokenizer.decode(encoding.ids)
169
+ decoded_bytes = decoded_str.encode("latin-1")
170
+ assert decoded_bytes == data # Perfect reconstruction
171
+ ```
172
+
173
+ **Example output for `/usr/bin/ls` (142,312 bytes)**:
174
+ ```
175
+ File size: 142312 bytes
176
+ Total tokens: 65458
177
+ Compression: 2.174 bytes/token
178
+
179
+ First 10 tokens:
180
+ Token 0: ID= 127 hex=7f (1 bytes)
181
+ Token 1: ID= 7652 hex=454c (2 bytes)
182
+ Token 2: ID= 70 hex=46 (1 bytes)
183
+ Token 3: ID= 2 hex=02 (1 bytes)
184
+ Token 4: ID= 772 hex=0101 (2 bytes)
185
+ Token 5: ID= 1332 hex=000000000000000000 (9 bytes)
186
+ Token 6: ID= 531 hex=0300 (2 bytes)
187
+ Token 7: ID= 2802 hex=3e00 (2 bytes)
188
+ Token 8: ID= 556 hex=01000000 (4 bytes)
189
+ Token 9: ID= 48 hex=30 (1 bytes)
190
+
191
+ Decoded: 7f454c4602010100000000000000000003003e000100000030...
192
+ (ELF header: 7f 45 4c 46 = ELF magic bytes)
193
+ ```
194
+
195
+ ---
196
+
197
+ ## Citation
198
+
199
+ If you use this tokenizer in your research, please cite:
200
+
201
+ ```bibtex
202
+ @article{bommarito2025binarybpe,
203
+ title={Binary BPE: Cross-Platform Tokenization for Binary Analysis},
204
+ author={Bommarito II, Michael J.},
205
+ journal={arXiv preprint},
206
+ year={2025},
207
+ note={Preprint coming soon}
208
+ }
209
+ ```
210
+
211
+ **Author**: Michael J. Bommarito II ([[email protected]](mailto:[email protected]))
212
+
213
+ ---
214
+
215
+ **Generated**: November 12, 2025
216
+ **Training Script**: `train_tokenizers.sh`
217
+ **Analysis Script**: `analyze_tokenizer.py`
analysis_results.json ADDED
@@ -0,0 +1,140 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "vocab_size": {
3
+ "total": 8185,
4
+ "total_with_special": 8192,
5
+ "base": 256,
6
+ "merges": 7929,
7
+ "special": 7,
8
+ "is_power_of_2": true,
9
+ "power": 13,
10
+ "matches_expected": true
11
+ },
12
+ "reachability": {
13
+ "valid_merges": 7929,
14
+ "invalid_merges": 0,
15
+ "reachable": 8185,
16
+ "unreachable": 0,
17
+ "all_reachable": true
18
+ },
19
+ "length_dist": {
20
+ "distribution": {
21
+ "1": 256,
22
+ "2": 3659,
23
+ "3": 1774,
24
+ "4": 1464,
25
+ "5": 290,
26
+ "6": 245,
27
+ "7": 96,
28
+ "8": 146,
29
+ "9": 36,
30
+ "10": 41,
31
+ "11": 26,
32
+ "12": 34,
33
+ "13": 16,
34
+ "14": 16,
35
+ "15": 10,
36
+ "16": 32,
37
+ "17": 4,
38
+ "18": 3,
39
+ "19": 4,
40
+ "20": 4,
41
+ "21": 3,
42
+ "22": 2,
43
+ "23": 2,
44
+ "24": 4,
45
+ "25": 1,
46
+ "27": 2,
47
+ "28": 1,
48
+ "29": 1,
49
+ "31": 1,
50
+ "32": 11,
51
+ "30": 1
52
+ },
53
+ "avg_length": 3.3121563836285888,
54
+ "min_length": 1,
55
+ "max_length": 32,
56
+ "length_3_count": 1774,
57
+ "length_3_percent": 21.67379352474038
58
+ },
59
+ "byte_content": {
60
+ "null_tokens": 2145,
61
+ "ascii_printable": 1831,
62
+ "ascii_only": 3810,
63
+ "high_byte": 4375,
64
+ "mixed": 2060,
65
+ "byte_distribution": {
66
+ "0": 5087,
67
+ "255": 829,
68
+ "72": 684,
69
+ "1": 633,
70
+ "32": 528,
71
+ "139": 505,
72
+ "3": 455,
73
+ "116": 381,
74
+ "36": 358,
75
+ "2": 351,
76
+ "204": 347,
77
+ "64": 339,
78
+ "101": 333,
79
+ "65": 286,
80
+ "249": 279,
81
+ "128": 261,
82
+ "97": 259,
83
+ "4": 248,
84
+ "137": 247,
85
+ "114": 235,
86
+ "110": 234,
87
+ "15": 227,
88
+ "105": 226,
89
+ "115": 216,
90
+ "8": 213,
91
+ "111": 211,
92
+ "68": 182,
93
+ "108": 182,
94
+ "16": 176,
95
+ "232": 173,
96
+ "99": 173,
97
+ "131": 170,
98
+ "117": 161,
99
+ "145": 159,
100
+ "169": 159,
101
+ "76": 159,
102
+ "112": 157,
103
+ "100": 154,
104
+ "6": 149,
105
+ "84": 148,
106
+ "69": 147,
107
+ "5": 144,
108
+ "48": 144,
109
+ "192": 142,
110
+ "95": 142,
111
+ "31": 138,
112
+ "224": 135,
113
+ "141": 127,
114
+ "170": 123,
115
+ "102": 122
116
+ }
117
+ },
118
+ "diversity": {
119
+ "1": {
120
+ "learned": 256,
121
+ "possible": 256,
122
+ "coverage": 100.0
123
+ },
124
+ "2": {
125
+ "learned": 3659,
126
+ "possible": 65536,
127
+ "coverage": 5.58319091796875
128
+ },
129
+ "3": {
130
+ "learned": 1774,
131
+ "possible": 16777216,
132
+ "coverage": 0.010573863983154297
133
+ },
134
+ "4": {
135
+ "learned": 1464,
136
+ "possible": 4294967296,
137
+ "coverage": 3.4086406230926514e-05
138
+ }
139
+ }
140
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff