philipp-zettl ibibrahim commited on
Commit
248fdaa
·
verified ·
0 Parent(s):

Duplicate from ibm-granite/granite-docling-258M

Browse files

Co-authored-by: Ibrahim Ibrahim <[email protected]>

.gitattributes ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ *.7z filter=lfs diff=lfs merge=lfs -text
2
+ *.arrow filter=lfs diff=lfs merge=lfs -text
3
+ *.bin filter=lfs diff=lfs merge=lfs -text
4
+ *.bz2 filter=lfs diff=lfs merge=lfs -text
5
+ *.ckpt filter=lfs diff=lfs merge=lfs -text
6
+ *.ftz filter=lfs diff=lfs merge=lfs -text
7
+ *.gz filter=lfs diff=lfs merge=lfs -text
8
+ *.h5 filter=lfs diff=lfs merge=lfs -text
9
+ *.joblib filter=lfs diff=lfs merge=lfs -text
10
+ *.lfs.* filter=lfs diff=lfs merge=lfs -text
11
+ *.mlmodel filter=lfs diff=lfs merge=lfs -text
12
+ *.model filter=lfs diff=lfs merge=lfs -text
13
+ *.msgpack filter=lfs diff=lfs merge=lfs -text
14
+ *.npy filter=lfs diff=lfs merge=lfs -text
15
+ *.npz filter=lfs diff=lfs merge=lfs -text
16
+ *.onnx filter=lfs diff=lfs merge=lfs -text
17
+ *.ot filter=lfs diff=lfs merge=lfs -text
18
+ *.parquet filter=lfs diff=lfs merge=lfs -text
19
+ *.pb filter=lfs diff=lfs merge=lfs -text
20
+ *.pickle filter=lfs diff=lfs merge=lfs -text
21
+ *.pkl filter=lfs diff=lfs merge=lfs -text
22
+ *.pt filter=lfs diff=lfs merge=lfs -text
23
+ *.pth filter=lfs diff=lfs merge=lfs -text
24
+ *.rar filter=lfs diff=lfs merge=lfs -text
25
+ *.safetensors filter=lfs diff=lfs merge=lfs -text
26
+ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
27
+ *.tar.* filter=lfs diff=lfs merge=lfs -text
28
+ *.tar filter=lfs diff=lfs merge=lfs -text
29
+ *.tflite filter=lfs diff=lfs merge=lfs -text
30
+ *.tgz filter=lfs diff=lfs merge=lfs -text
31
+ *.wasm filter=lfs diff=lfs merge=lfs -text
32
+ *.xz filter=lfs diff=lfs merge=lfs -text
33
+ *.zip filter=lfs diff=lfs merge=lfs -text
34
+ *.zst filter=lfs diff=lfs merge=lfs -text
35
+ *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ granite_docling.png filter=lfs diff=lfs merge=lfs -text
37
+ assets/new_arxiv.png filter=lfs diff=lfs merge=lfs -text
38
+ assets/granite_docling_split_page.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,583 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - ds4sd/SynthCodeNet
5
+ - ds4sd/SynthFormulaNet
6
+ - ds4sd/SynthChartNet
7
+ - HuggingFaceM4/DoclingMatix
8
+ tags:
9
+ - text-generation
10
+ - documents
11
+ - code
12
+ - formula
13
+ - chart
14
+ - ocr
15
+ - layout
16
+ - table
17
+ - document-parse
18
+ - docling
19
+ - granite
20
+ - extraction
21
+ - math
22
+ language:
23
+ - en
24
+ pipeline_tag: image-text-to-text
25
+ library_name: transformers
26
+ ---
27
+
28
+ # granite-docling-258m
29
+ <div style="display: flex; align-items: center;">
30
+ <img src="https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/granite_docling.png" alt="Granite Docling Logo" style="width: 200px; height: auto; margin-right: 20px;">
31
+ <div>
32
+ <p>Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with <a href="https://docling-project.github.io/docling ">DoclingDocuments</a> to ensure full compatibility. </p>
33
+ </div>
34
+ </div>
35
+
36
+ **Model Summary**:
37
+
38
+ Granite Docling 258M builds upon the Idefics3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM. Try out our [Granite-Docling-258](https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo) demo today.
39
+
40
+ - **Developed by**: IBM Research
41
+ - **Model type**: Multi-modal model (image+text-to-text)
42
+ - **Language(s)**: English (NLP)
43
+ - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
44
+ - **Release Date**: September 17, 2025
45
+
46
+ Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
47
+
48
+ - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
49
+ - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
50
+ - 🧘 Improved Stability: Tends to avoid infinite loops more effectively
51
+ - 🧮 Enhanced Inline Equations: Better inline math recognition
52
+ - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
53
+ - 🌍 Japanese, Arabic and Chinese support (_experimental_)
54
+
55
+
56
+
57
+ ## Getting started
58
+
59
+ The easiest way to use this model is through the [🐥Docling](https://github.com/docling-project/docling) library. It will automatically download this model and convert documents to various formats for you.
60
+
61
+ Install the latest version of `docling` through pip, then use the following CLI command:
62
+
63
+ ```sh
64
+ # Convert to HTML and Markdown:
65
+ docling --to html --to md --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887" # accepts files, urls or directories
66
+
67
+ # Convert to HTML including layout visualization:
68
+ docling --to html_split_page --show-layout --pipeline vlm --vlm-model granite_docling "https://arxiv.org/pdf/2501.17887"
69
+
70
+ ```
71
+
72
+ <p align="center">
73
+ <img src="https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/granite_docling_split_page.png" alt="GraniteDocling result in split page view" width="900"/>
74
+ </p>
75
+
76
+ <details>
77
+ <summary>You can also set this model up within the Docling SDK:</summary>
78
+
79
+ ```python
80
+ from docling.datamodel import vlm_model_specs
81
+ from docling.datamodel.base_models import InputFormat
82
+ from docling.datamodel.pipeline_options import (
83
+ VlmPipelineOptions,
84
+ )
85
+ from docling.document_converter import DocumentConverter, PdfFormatOption
86
+ from docling.pipeline.vlm_pipeline import VlmPipeline
87
+
88
+ source = "https://arxiv.org/pdf/2501.17887"
89
+
90
+ ###### USING SIMPLE DEFAULT VALUES
91
+ # - GraniteDocling model
92
+ # - Using the transformers framework
93
+
94
+ converter = DocumentConverter(
95
+ format_options={
96
+ InputFormat.PDF: PdfFormatOption(
97
+ pipeline_cls=VlmPipeline,
98
+ ),
99
+ }
100
+ )
101
+
102
+ doc = converter.convert(source=source).document
103
+
104
+ print(doc.export_to_markdown())
105
+
106
+
107
+ ###### USING MACOS MPS ACCELERATOR
108
+ # For more options see the compare_vlm_models.py example.
109
+
110
+ pipeline_options = VlmPipelineOptions(
111
+ vlm_options=vlm_model_specs.GRANITEDOCLING_MLX,
112
+ )
113
+
114
+ converter = DocumentConverter(
115
+ format_options={
116
+ InputFormat.PDF: PdfFormatOption(
117
+ pipeline_cls=VlmPipeline,
118
+ pipeline_options=pipeline_options,
119
+ ),
120
+ }
121
+ )
122
+
123
+ doc = converter.convert(source=source).document
124
+
125
+ print(doc.export_to_markdown())
126
+ ```
127
+ </details>
128
+
129
+
130
+ Alternatively, you can use bare **transformers**, **vllm**, **onnx** or **mlx-vlm** to perform inference, and [docling-core](https://github.com/docling-project/docling-core) APIs to convert results to variety of output formats (md, html, etc.):
131
+
132
+ <details>
133
+ <summary>📄 Single page image inference using plain 🤗 tranformers 🤖</summary>
134
+
135
+ ```python
136
+ # Prerequisites:
137
+ # pip install torch
138
+ # pip install docling_core
139
+ # pip install transformers
140
+
141
+ import torch
142
+ from docling_core.types.doc import DoclingDocument
143
+ from docling_core.types.doc.document import DocTagsDocument
144
+ from transformers import AutoProcessor, AutoModelForVision2Seq
145
+ from transformers.image_utils import load_image
146
+ from pathlib import Path
147
+
148
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
149
+
150
+ # Load images
151
+ image = load_image("https://huggingface.co/ibm-granite/granite-docling-258M/resolve/main/assets/new_arxiv.png")
152
+
153
+ # Initialize processor and model
154
+ processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
155
+ model = AutoModelForVision2Seq.from_pretrained(
156
+ "ibm-granite/granite-docling-258M",
157
+ torch_dtype=torch.bfloat16,
158
+ _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "sdpa",
159
+ ).to(DEVICE)
160
+
161
+ # Create input messages
162
+ messages = [
163
+ {
164
+ "role": "user",
165
+ "content": [
166
+ {"type": "image"},
167
+ {"type": "text", "text": "Convert this page to docling."}
168
+ ]
169
+ },
170
+ ]
171
+
172
+ # Prepare inputs
173
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
174
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
175
+ inputs = inputs.to(DEVICE)
176
+
177
+ # Generate outputs
178
+ generated_ids = model.generate(**inputs, max_new_tokens=8192)
179
+ prompt_length = inputs.input_ids.shape[1]
180
+ trimmed_generated_ids = generated_ids[:, prompt_length:]
181
+ doctags = processor.batch_decode(
182
+ trimmed_generated_ids,
183
+ skip_special_tokens=False,
184
+ )[0].lstrip()
185
+
186
+ print(f"DocTags: \n{doctags}\n")
187
+
188
+
189
+ # Populate document
190
+ doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
191
+ # create a docling document
192
+ doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
193
+ print(f"Markdown:\n{doc.export_to_markdown()}\n")
194
+
195
+ ## export as any format.
196
+ # Path("out/").mkdir(parents=True, exist_ok=True)
197
+ # HTML:
198
+ # output_path_html = Path("out/") / "example.html"
199
+ # doc.save_as_html(output_path_html)
200
+ # Markdown:
201
+ # output_path_md = Path("out/") / "example.md"
202
+ # doc.save_as_markdown(output_path_md)
203
+
204
+ ```
205
+ </details>
206
+
207
+
208
+ <details>
209
+ <summary> 🚀 Fast Batch Inference with VLLM</summary>
210
+
211
+ ```python
212
+ # Prerequisites:
213
+ # pip install vllm
214
+ # pip install docling_core
215
+ # place page images you want to convert into "img/" dir
216
+
217
+ import time
218
+ import os
219
+ from vllm import LLM, SamplingParams
220
+ from transformers import AutoProcessor
221
+ from PIL import Image
222
+ from docling_core.types.doc import DoclingDocument
223
+ from docling_core.types.doc.document import DocTagsDocument
224
+ from pathlib import Path
225
+
226
+ # Configuration
227
+ MODEL_PATH = "ibm-granite/granite-docling-258M"
228
+ IMAGE_DIR = "img/" # Place your page images here
229
+ OUTPUT_DIR = "out/"
230
+ PROMPT_TEXT = "Convert this page to docling."
231
+
232
+ messages = [
233
+ {
234
+ "role": "user",
235
+ "content": [
236
+ {"type": "image"},
237
+ {"type": "text", "text": PROMPT_TEXT},
238
+ ],
239
+ },
240
+ ]
241
+
242
+
243
+ # Ensure output directory exists
244
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
245
+
246
+ # Initialize LLM
247
+ llm = LLM(model=MODEL_PATH, revision="untied", limit_mm_per_prompt={"image": 1})
248
+ processor = AutoProcessor.from_pretrained(MODEL_PATH)
249
+
250
+ sampling_params = SamplingParams(
251
+ temperature=0.0,
252
+ max_tokens=8192,
253
+ skip_special_tokens=False,
254
+ )
255
+
256
+ # Load and prepare all images and prompts up front
257
+ batched_inputs = []
258
+ image_names = []
259
+
260
+ for img_file in sorted(os.listdir(IMAGE_DIR)):
261
+ if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
262
+ img_path = os.path.join(IMAGE_DIR, img_file)
263
+ with Image.open(img_path) as im:
264
+ image = im.convert("RGB")
265
+
266
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
267
+ batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
268
+ image_names.append(os.path.splitext(img_file)[0])
269
+
270
+ # Run batch inference
271
+ start_time = time.time()
272
+ outputs = llm.generate(batched_inputs, sampling_params=sampling_params)
273
+
274
+ # Postprocess all results
275
+ for img_fn, output, input_data in zip(image_names, outputs, batched_inputs):
276
+ doctags = output.outputs[0].text
277
+ output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt"
278
+ output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
279
+
280
+ with open(output_path_dt, "w", encoding="utf-8") as f:
281
+ f.write(doctags)
282
+
283
+ # Convert to DoclingDocument and save markdown
284
+ doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]])
285
+ doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
286
+ doc.save_as_markdown(output_path_md)
287
+
288
+ print(f"Total time: {time.time() - start_time:.2f} sec")
289
+
290
+ ```
291
+ </details>
292
+
293
+ 💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
294
+
295
+ ℹ️ If you see trouble running granite-docling with the codes above, check the troubleshooting section at the bottom ⬇️.
296
+
297
+ ## Intended Use
298
+ Granite-Docling is designed to complement the Docling library, not replace it. It integrates as a component within larger Docling library, consolidating the functions of multiple single-purpose models into a single, compact VLM.
299
+ However, Granite-Docling is **not** intended for general image understanding. For tasks focused solely on image-text input, we recommend using [Granite Vision models](https://huggingface.co/collections/ibm-granite/granite-vision-models-67b3bd4ff90c915ba4cd2800), which are purpose-built and optimized for image-text processing.
300
+
301
+ ## Evaluations
302
+ A comprehensive discussion of evaluation methods and findings has already been presented in our previous publication [[citation](https://arxiv.org/pdf/2503.11576)]. As this model is an update, we refer readers to that work for additional details.
303
+ The evaluation can be performed using the [docling-eval](https://github.com/docling-project/docling-eval) framework for the document related tasks, and [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) for MMStar and OCRBench.
304
+
305
+ <table>
306
+ <thead>
307
+ <tr><th colspan="5"><b>Layout</b></th></tr>
308
+ <tr>
309
+ <th></th>
310
+ <th>MAP ↑</th>
311
+ <th>F1 ↑</th>
312
+ <th>Precision ↑</th>
313
+ <th>Recall ↑</th>
314
+ </tr>
315
+ </thead>
316
+ <tbody>
317
+ <tr>
318
+ <td><b>smoldocling-256m-preview</b></td>
319
+ <td>0.23</td><td>0.85</td><td>0.9</td><td>0.84</td>
320
+ </tr>
321
+ <tr>
322
+ <td><b>granite-docling-258m</b></td>
323
+ <td><b>0.27</b></td><td><b>0.86</b></td><td><b>0.92</b></td><td><b>0.88</b></td>
324
+ </tr>
325
+ </tbody>
326
+ </table>
327
+
328
+ <table>
329
+ <thead>
330
+ <tr><th colspan="7"><b>Full Page OCR</b></th></tr>
331
+ <tr>
332
+ <th></th>
333
+ <th>Edit-distance ↓</th>
334
+ <th>F1 ↑</th>
335
+ <th>Precision ↑</th>
336
+ <th>Recall ↑</th>
337
+ <th>BLEU ↑</th>
338
+ <th>Meteor ↑</th>
339
+ </tr>
340
+ </thead>
341
+ <tbody>
342
+ <tr>
343
+ <td><b>smoldocling-256m-preview</b></td>
344
+ <td>0.48</td><td>0.80</td><td>0.89</td>
345
+ <td>0.79</td><td>0.58</td><td>0.67</td>
346
+ </tr>
347
+ <tr>
348
+ <td><b>granite-docling-258m</b></td>
349
+ <td><b>0.45</b></td><td><b>0.84</b></td><td><b>0.91</b></td>
350
+ <td><b>0.83</b></td><td><b>0.65</b></td><td><b>0.72</b></td>
351
+ </tr>
352
+ </tbody>
353
+ <thead>
354
+ <tr><th colspan="7"><b>Code Recognition</b></th></tr>
355
+ <tr>
356
+ <th></th>
357
+ <th>Edit-distance ↓</th>
358
+ <th>F1 ↑</th>
359
+ <th>Precision ↑</th>
360
+ <th>Recall ↑</th>
361
+ <th>BLEU ↑</th>
362
+ <th>Meteor ↑</th>
363
+ </tr>
364
+ </thead>
365
+ <tbody>
366
+ <tr>
367
+ <td><b>smoldocling-256m-preview</b></td>
368
+ <td>0.114</td><td>0.915</td><td>0.94</td><td>0.909</td><td>0.875</td><td>0.889</td>
369
+ </tr>
370
+ <tr>
371
+ <td><b>granite-docling-258m</b></td>
372
+ <td><b>0.013</b></td><td><b>0.988</b></td><td><b>0.99</b></td><td><b>0.988</b></td>
373
+ <td><b>0.983</b></td><td><b>0.986</b></td>
374
+ </tr>
375
+ </tbody>
376
+ <thead>
377
+ <tr><th colspan="7"><b>Equation Recognition</b></th></tr>
378
+ <tr>
379
+ <th></th>
380
+ <th>Edit-distance ↓</th>
381
+ <th>F1 ↑</th>
382
+ <th>Precision ↑</th>
383
+ <th>Recall ↑</th>
384
+ <th>BLEU ↑</th>
385
+ <th>Meteor ↑</th>
386
+ </tr>
387
+ </thead>
388
+ <tbody>
389
+ <tr>
390
+ <td><b>smoldocling-256m-preview</b></td>
391
+ <td>0.119</td><td>0.947</td><td>0.959</td><td>0.941</td><td>0.824</td><td>0.878</td>
392
+ </tr>
393
+ <tr>
394
+ <td><b>granite-docling-258m</b></td>
395
+ <td><b>0.073</b></td><td><b>0.968</b></td><td><b>0.968</b></td><td><b>0.969</b></td>
396
+ <td><b>0.893</b></td><td><b>0.927</b></td>
397
+ </tr>
398
+ </tbody>
399
+ </table>
400
+ <table>
401
+ <thead>
402
+ <tr><th colspan="3"><b>Table Recognition (FinTabNet 150dpi)</b></th></tr>
403
+ <tr>
404
+ <th></th>
405
+ <th>TEDS (structure) ↑</th>
406
+ <th>TEDS (w/content) ↑</th>
407
+ </tr>
408
+ </thead>
409
+ <tbody>
410
+ <tr>
411
+ <td><b>smoldocling-256m-preview</b></td>
412
+ <td>0.82</td><td>0.76</td>
413
+ </tr>
414
+ <tr>
415
+ <td><b>granite-docling-258m</b></td>
416
+ <td><b>0.97</b></td><td><b>0.96</b></td>
417
+ </tr>
418
+ </tbody>
419
+ </table>
420
+ <table>
421
+ <thead>
422
+ <tr><th colspan="3"><b>Other Benchmarks</b></th></tr>
423
+ <tr>
424
+ <th></th>
425
+ <th>MMStar ↑</th>
426
+ <th>OCRBench ↑</th>
427
+ </tr>
428
+ </thead>
429
+ <tbody>
430
+ <tr>
431
+ <td><b>smoldocling-256m-preview</b></td>
432
+ <td>0.17</td><td>338</td>
433
+ </tr>
434
+ <tr>
435
+ <td><b>granite-docling-258m</b></td>
436
+ <td><b>0.30</b></td><td><b>500</b></td>
437
+ </tr>
438
+ </tbody>
439
+ </table>
440
+
441
+
442
+
443
+ 💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
444
+
445
+
446
+ ## Supported Instructions
447
+
448
+ <table>
449
+ <tr>
450
+ <th>Description</th>
451
+ <th>Instruction</th>
452
+ <th>Short Instruction</th>
453
+ </tr>
454
+ <tr>
455
+ <td><b>Full conversion</b></td>
456
+ <td>Convert this page to docling.</td>
457
+ <td>-</td>
458
+ </tr>
459
+ <tr>
460
+ <td><b>Chart</b></td>
461
+ <td>Convert chart to table.</td>
462
+ <td><code>&lt;chart&gt;</code></td>
463
+ </tr>
464
+ <tr>
465
+ <td><b>Formula</b></td>
466
+ <td>Convert formula to LaTeX.</td>
467
+ <td><code>&lt;formula&gt;</code></td>
468
+ </tr>
469
+ <tr>
470
+ <td><b>Code</b></td>
471
+ <td>Convert code to text.</td>
472
+ <td><code>&lt;code&gt;</code></td>
473
+ </tr>
474
+ <tr>
475
+ <td><b>Table</b></td>
476
+ <td>Convert table to OTSL. (<a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a>)</td>
477
+ <td><code>&lt;otsl&gt;</code></td>
478
+ </tr>
479
+ <tr>
480
+ <td rowspan="4"><b>Actions and Pipelines</b></td>
481
+ <td>OCR the text in a specific location: &lt;loc_155&gt;&lt;loc_233&gt;&lt;loc_206&gt;&lt;loc_237&gt;</td>
482
+ <td>-</td>
483
+ </tr>
484
+ <tr>
485
+ <td>Identify element at: &lt;loc_247&gt;&lt;loc_482&gt;&lt;loc_252&gt;&lt;loc_486&gt;</td>
486
+ <td>-</td>
487
+ </tr>
488
+ <tr>
489
+ <td>Find all 'text' elements on the page, retrieve all section headers.</td>
490
+ <td>-</td>
491
+ </tr>
492
+ <tr>
493
+ <td>Detect footer elements on the page.</td>
494
+ <td>-</td>
495
+ </tr>
496
+ </table>
497
+
498
+
499
+
500
+ # Model Architecture:
501
+
502
+ The architecture of granite-docling-258m consists of the following components:
503
+
504
+ (1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
505
+
506
+ (2) Vision-language connector: pixel shuffle projector (as in idefics3)
507
+
508
+ (3) Large language model: Granite 165M.
509
+
510
+ We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
511
+ The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
512
+
513
+
514
+ **Training Data**: Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities.
515
+
516
+ In particular, we incorporate:
517
+
518
+ * [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages
519
+ * [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations
520
+ * [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) — synthetic chart images annotated with structured table outputs
521
+ * [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) — a curated corpus of real-world document pages sampled from diverse domains
522
+
523
+
524
+ **Infrastructure**: We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
525
+
526
+ **Responsible Use and Limitations** Some use cases for Vision Language Models can trigger certain risks and ethical considerations, including but not limited to: bias and fairness, misinformation, and autonomous decision-making.
527
+ Although our alignment processes include safety considerations, the model may in some cases produce inaccurate, biased, offensive or unwanted responses to user prompts. Additionally, whether smaller models may exhibit increased susceptibility
528
+ to hallucination in generation scenarios due to their reduced sizes, which could limit their ability to generate coherent and contextually accurate responses, remains uncertain. This aspect is currently an active area of research,
529
+ and we anticipate more rigorous exploration, comprehension, and mitigations in this domain. We urge the community to use granite-docling-258m in a responsible way and avoid any malicious utilization. We recommend using this model only as part of the Docling library.
530
+ More general vision tasks may pose higher inherent risks of triggering unwanted output. To enhance safety, we recommend using granite-docling-258m alongside Granite Guardian. Granite Guardian is a fine-tuned instruct model designed to detect and flag risks in prompts and responses across key dimensions outlined in the IBM AI Risk Atlas.
531
+ Its training, which includes both human-annotated and synthetic data informed by internal red-teaming, enables it to outperform similar open-source models on standard benchmarks, providing an additional layer of safety.
532
+
533
+ **Resources**
534
+
535
+ - ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features
536
+ - 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/
537
+ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
538
+ - 🖥️ Learn more about how to use Granite-Docling, explore the Docling library, and see what’s coming next for Docling in the release blog: https://ibm.com/new/announcements/granite-docling-end-to-end-document-conversion
539
+
540
+ ## Troubleshooting
541
+
542
+ **Running with VLLM**
543
+
544
+ 1. You receive `AttributeError: 'LlamaModel' object has no attribute 'wte'` when launching the model through VLLM.
545
+
546
+ With current versions of VLLM (including 0.10.2), support for tied weights as used in granite-docling is limited and breaks. We provide a version with untied weights on the `untied` branch of this model repo.
547
+ To use the untied version, please pass the `revision` argument to VLLM:
548
+
549
+ ```sh
550
+ # Serve the model through VLLM
551
+ $> vllm serve ibm-granite/granite-docling-258M --revision untied
552
+ ```
553
+
554
+ ```python
555
+ # If using the VLLM python SDK:
556
+ from vllm import LLM
557
+ ...
558
+
559
+ llm = LLM(model=MODEL_PATH, revision="untied", limit_mm_per_prompt={"image": 1})
560
+ ```
561
+
562
+ 2. The model outputs only exclamation marks (i.e. "!!!!!!!!!!!!!!!").
563
+
564
+ This is seen on older NVIDIA GPUs, such as the T4 GPU available in Google Colab, because it lacks support for `bfloat16` format.
565
+ You can work around it by setting the `dtype` to `float32`.
566
+
567
+ ```sh
568
+ # Serve the model through VLLM
569
+ $> vllm serve ibm-granite/granite-docling-258M --revision untied --dtype float32
570
+ ```
571
+
572
+ ```python
573
+ # If using the VLLM python SDK:
574
+ from vllm import LLM
575
+ ...
576
+
577
+ llm = LLM(model=MODEL_PATH, revision="untied", limit_mm_per_prompt={"image": 1}, dtype="float32")
578
+ ```
579
+
580
+
581
+
582
+
583
+
added_tokens.json ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ {
2
+ "<end_of_utterance>": 100352
3
+ }
assets/granite_docling_split_page.png ADDED

Git LFS Details

  • SHA256: a1bd51ff1cea9daabc7f00522c7cc4b5905fa5cd3f67c40d0d707ee0686ce94b
  • Pointer size: 132 Bytes
  • Size of remote file: 2.32 MB
assets/new_arxiv.png ADDED

Git LFS Details

  • SHA256: 15e72aca956d9e796788eaa4b0debb9ae988ca7a9637ae3b6df1e6ce671d73d0
  • Pointer size: 131 Bytes
  • Size of remote file: 523 kB
chat_template.jinja ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {%- for message in messages -%}
2
+ {{- '<|start_of_role|>' + message['role'] + '<|end_of_role|>' -}}
3
+ {%- if message['content'] is string -%}
4
+ {{- message['content'] -}}
5
+ {%- else -%}
6
+ {%- for part in message['content'] -%}
7
+ {%- if part['type'] == 'text' -%}
8
+ {{- part['text'] -}}
9
+ {%- elif part['type'] == 'image' -%}
10
+ {{- '<image>' -}}
11
+ {%- endif -%}
12
+ {%- endfor -%}
13
+ {%- endif -%}
14
+ {{- '<|end_of_text|>
15
+ ' -}}
16
+ {%- endfor -%}
17
+ {%- if add_generation_prompt -%}
18
+ {{- '<|start_of_role|>assistant' -}}
19
+ {%- if controls -%}{{- ' ' + controls | tojson() -}}{%- endif -%}
20
+ {{- '<|end_of_role|>' -}}
21
+ {%- endif -%}
config.json ADDED
@@ -0,0 +1,66 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Idefics3ForConditionalGeneration"
4
+ ],
5
+ "bos_token_id": 100264,
6
+ "dtype": "bfloat16",
7
+ "eos_token_id": 100257,
8
+ "image_token_id": 100270,
9
+ "model_type": "idefics3",
10
+ "pad_token_id": 100257,
11
+ "scale_factor": 4,
12
+ "text_config": {
13
+ "_name_or_path": "models/granitev06_hf_ai4k_sft_data_v4",
14
+ "architectures": [
15
+ "LlamaForCausalLM"
16
+ ],
17
+ "attention_bias": false,
18
+ "attention_dropout": 0.0,
19
+ "bos_token_id": 100264,
20
+ "dtype": "bfloat16",
21
+ "eos_token_id": 100257,
22
+ "head_dim": 64,
23
+ "hidden_act": "silu",
24
+ "hidden_size": 576,
25
+ "initializer_range": 0.02,
26
+ "intermediate_size": 1536,
27
+ "max_position_embeddings": 8192,
28
+ "mlp_bias": false,
29
+ "model_type": "llama",
30
+ "num_attention_heads": 9,
31
+ "num_hidden_layers": 30,
32
+ "num_key_value_heads": 3,
33
+ "pad_token_id": 100257,
34
+ "pretraining_tp": 1,
35
+ "rms_norm_eps": 1e-05,
36
+ "rope_scaling": null,
37
+ "rope_theta": 100000.0,
38
+ "tie_word_embeddings": true,
39
+ "use_cache": false,
40
+ "vocab_size": 100352
41
+ },
42
+ "tie_word_embeddings": true,
43
+ "transformers_version": "4.56.1",
44
+ "use_cache": true,
45
+ "vision_config": {
46
+ "attention_dropout": 0.0,
47
+ "hidden_act": "gelu_pytorch_tanh",
48
+ "hidden_size": 768,
49
+ "image_size": 512,
50
+ "initializer_range": 0.02,
51
+ "intermediate_size": 3072,
52
+ "layer_norm_eps": 1e-06,
53
+ "max_image_size": {
54
+ "longest_edge": 512
55
+ },
56
+ "model_type": "idefics3_vision",
57
+ "num_attention_heads": 12,
58
+ "num_channels": 3,
59
+ "num_hidden_layers": 12,
60
+ "patch_size": 16,
61
+ "size": {
62
+ "longest_edge": 512
63
+ }
64
+ },
65
+ "vocab_size": 100352
66
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100264,
4
+ "eos_token_id": 100257,
5
+ "pad_token_id": 100257,
6
+ "transformers_version": "4.56.1",
7
+ "use_cache": false
8
+ }
granite_docling.png ADDED

Git LFS Details

  • SHA256: f4d43939df541ab6958e989e7a6761a8db1ccb484dcb0c2749fedbb1357d2bb8
  • Pointer size: 132 Bytes
  • Size of remote file: 2.19 MB
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:1cdad234deb1cde18ee6a586f849057f19851daf1fedce2e40aff791dbe46f61
3
+ size 515093104
preprocessor_config.json ADDED
@@ -0,0 +1,28 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "do_convert_rgb": true,
3
+ "do_image_splitting": true,
4
+ "do_normalize": true,
5
+ "do_pad": true,
6
+ "do_rescale": true,
7
+ "do_resize": true,
8
+ "image_mean": [
9
+ 0.5,
10
+ 0.5,
11
+ 0.5
12
+ ],
13
+ "image_processor_type": "Idefics3ImageProcessor",
14
+ "image_std": [
15
+ 0.5,
16
+ 0.5,
17
+ 0.5
18
+ ],
19
+ "max_image_size": {
20
+ "longest_edge": 512
21
+ },
22
+ "processor_class": "Idefics3Processor",
23
+ "resample": 1,
24
+ "rescale_factor": 0.00392156862745098,
25
+ "size": {
26
+ "longest_edge": 2048
27
+ }
28
+ }
processor_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "image_seq_len": 64,
3
+ "processor_class": "Idefics3Processor"
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,40 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "additional_special_tokens": [
3
+ {
4
+ "content": "<fake_token_around_image>",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false
9
+ },
10
+ {
11
+ "content": "<image>",
12
+ "lstrip": false,
13
+ "normalized": false,
14
+ "rstrip": false,
15
+ "single_word": false
16
+ }
17
+ ],
18
+ "bos_token": {
19
+ "content": "<|start_of_role|>",
20
+ "lstrip": false,
21
+ "normalized": false,
22
+ "rstrip": false,
23
+ "single_word": false
24
+ },
25
+ "eos_token": {
26
+ "content": "<|end_of_text|>",
27
+ "lstrip": false,
28
+ "normalized": false,
29
+ "rstrip": false,
30
+ "single_word": false
31
+ },
32
+ "pad_token": "<|end_of_text|>",
33
+ "unk_token": {
34
+ "content": "<|unk|>",
35
+ "lstrip": false,
36
+ "normalized": false,
37
+ "rstrip": false,
38
+ "single_word": false
39
+ }
40
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,789 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": false,
3
+ "add_prefix_space": false,
4
+ "added_tokens_decoder": {
5
+ "100256": {
6
+ "content": "<|pad|>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false,
11
+ "special": true
12
+ },
13
+ "100257": {
14
+ "content": "<|end_of_text|>",
15
+ "lstrip": false,
16
+ "normalized": false,
17
+ "rstrip": false,
18
+ "single_word": false,
19
+ "special": true
20
+ },
21
+ "100258": {
22
+ "content": "<row_1_col_1>",
23
+ "lstrip": false,
24
+ "normalized": false,
25
+ "rstrip": false,
26
+ "single_word": false,
27
+ "special": true
28
+ },
29
+ "100259": {
30
+ "content": "<row_1_col_2>",
31
+ "lstrip": false,
32
+ "normalized": false,
33
+ "rstrip": false,
34
+ "single_word": false,
35
+ "special": true
36
+ },
37
+ "100260": {
38
+ "content": "<text>",
39
+ "lstrip": false,
40
+ "normalized": false,
41
+ "rstrip": false,
42
+ "single_word": false,
43
+ "special": true
44
+ },
45
+ "100261": {
46
+ "content": "<row_1_col_3>",
47
+ "lstrip": false,
48
+ "normalized": false,
49
+ "rstrip": false,
50
+ "single_word": false,
51
+ "special": true
52
+ },
53
+ "100262": {
54
+ "content": "<row_1_col_4>",
55
+ "lstrip": false,
56
+ "normalized": false,
57
+ "rstrip": false,
58
+ "single_word": false,
59
+ "special": true
60
+ },
61
+ "100263": {
62
+ "content": "<row_2_col_1>",
63
+ "lstrip": false,
64
+ "normalized": false,
65
+ "rstrip": false,
66
+ "single_word": false,
67
+ "special": true
68
+ },
69
+ "100264": {
70
+ "content": "<|start_of_role|>",
71
+ "lstrip": false,
72
+ "normalized": false,
73
+ "rstrip": false,
74
+ "single_word": false,
75
+ "special": true
76
+ },
77
+ "100265": {
78
+ "content": "<|end_of_role|>",
79
+ "lstrip": false,
80
+ "normalized": false,
81
+ "rstrip": false,
82
+ "single_word": false,
83
+ "special": true
84
+ },
85
+ "100266": {
86
+ "content": "</title>",
87
+ "lstrip": false,
88
+ "normalized": false,
89
+ "rstrip": false,
90
+ "single_word": false,
91
+ "special": true
92
+ },
93
+ "100267": {
94
+ "content": "<row_2_col_2>",
95
+ "lstrip": false,
96
+ "normalized": false,
97
+ "rstrip": false,
98
+ "single_word": false,
99
+ "special": true
100
+ },
101
+ "100268": {
102
+ "content": "<row_2_col_3>",
103
+ "lstrip": false,
104
+ "normalized": false,
105
+ "rstrip": false,
106
+ "single_word": false,
107
+ "special": true
108
+ },
109
+ "100269": {
110
+ "content": "<title>",
111
+ "lstrip": false,
112
+ "normalized": false,
113
+ "rstrip": false,
114
+ "single_word": false,
115
+ "special": true
116
+ },
117
+ "100270": {
118
+ "content": "<image>",
119
+ "lstrip": false,
120
+ "normalized": false,
121
+ "rstrip": false,
122
+ "single_word": false,
123
+ "special": true
124
+ },
125
+ "100271": {
126
+ "content": "<caption>",
127
+ "lstrip": false,
128
+ "normalized": false,
129
+ "rstrip": false,
130
+ "single_word": false,
131
+ "special": true
132
+ },
133
+ "100272": {
134
+ "content": "</caption>",
135
+ "lstrip": false,
136
+ "normalized": false,
137
+ "rstrip": false,
138
+ "single_word": false,
139
+ "special": true
140
+ },
141
+ "100273": {
142
+ "content": "<footnote>",
143
+ "lstrip": false,
144
+ "normalized": false,
145
+ "rstrip": false,
146
+ "single_word": false,
147
+ "special": true
148
+ },
149
+ "100274": {
150
+ "content": "</footnote>",
151
+ "lstrip": false,
152
+ "normalized": false,
153
+ "rstrip": false,
154
+ "single_word": false,
155
+ "special": true
156
+ },
157
+ "100275": {
158
+ "content": "<formula>",
159
+ "lstrip": false,
160
+ "normalized": false,
161
+ "rstrip": false,
162
+ "single_word": false,
163
+ "special": true
164
+ },
165
+ "100276": {
166
+ "content": "</formula>",
167
+ "lstrip": false,
168
+ "normalized": false,
169
+ "rstrip": false,
170
+ "single_word": false,
171
+ "special": true
172
+ },
173
+ "100277": {
174
+ "content": "<list_item>",
175
+ "lstrip": false,
176
+ "normalized": false,
177
+ "rstrip": false,
178
+ "single_word": false,
179
+ "special": true
180
+ },
181
+ "100278": {
182
+ "content": "</list_item>",
183
+ "lstrip": false,
184
+ "normalized": false,
185
+ "rstrip": false,
186
+ "single_word": false,
187
+ "special": true
188
+ },
189
+ "100279": {
190
+ "content": "<page_footer>",
191
+ "lstrip": false,
192
+ "normalized": false,
193
+ "rstrip": false,
194
+ "single_word": false,
195
+ "special": true
196
+ },
197
+ "100280": {
198
+ "content": "</page_footer>",
199
+ "lstrip": false,
200
+ "normalized": false,
201
+ "rstrip": false,
202
+ "single_word": false,
203
+ "special": true
204
+ },
205
+ "100281": {
206
+ "content": "<page_header>",
207
+ "lstrip": false,
208
+ "normalized": false,
209
+ "rstrip": false,
210
+ "single_word": false,
211
+ "special": true
212
+ },
213
+ "100282": {
214
+ "content": "</page_header>",
215
+ "lstrip": false,
216
+ "normalized": false,
217
+ "rstrip": false,
218
+ "single_word": false,
219
+ "special": true
220
+ },
221
+ "100283": {
222
+ "content": "<picture>",
223
+ "lstrip": false,
224
+ "normalized": false,
225
+ "rstrip": false,
226
+ "single_word": false,
227
+ "special": true
228
+ },
229
+ "100284": {
230
+ "content": "</picture>",
231
+ "lstrip": false,
232
+ "normalized": false,
233
+ "rstrip": false,
234
+ "single_word": false,
235
+ "special": true
236
+ },
237
+ "100285": {
238
+ "content": "<section_header_level_1>",
239
+ "lstrip": false,
240
+ "normalized": false,
241
+ "rstrip": false,
242
+ "single_word": false,
243
+ "special": true
244
+ },
245
+ "100286": {
246
+ "content": "</section_header_level_1>",
247
+ "lstrip": false,
248
+ "normalized": false,
249
+ "rstrip": false,
250
+ "single_word": false,
251
+ "special": true
252
+ },
253
+ "100287": {
254
+ "content": "<section_header_level_2>",
255
+ "lstrip": false,
256
+ "normalized": false,
257
+ "rstrip": false,
258
+ "single_word": false,
259
+ "special": true
260
+ },
261
+ "100288": {
262
+ "content": "</section_header_level_2>",
263
+ "lstrip": false,
264
+ "normalized": false,
265
+ "rstrip": false,
266
+ "single_word": false,
267
+ "special": true
268
+ },
269
+ "100289": {
270
+ "content": "<section_header_level_3>",
271
+ "lstrip": false,
272
+ "normalized": false,
273
+ "rstrip": false,
274
+ "single_word": false,
275
+ "special": true
276
+ },
277
+ "100290": {
278
+ "content": "</section_header_level_3>",
279
+ "lstrip": false,
280
+ "normalized": false,
281
+ "rstrip": false,
282
+ "single_word": false,
283
+ "special": true
284
+ },
285
+ "100291": {
286
+ "content": "<section_header_level_4>",
287
+ "lstrip": false,
288
+ "normalized": false,
289
+ "rstrip": false,
290
+ "single_word": false,
291
+ "special": true
292
+ },
293
+ "100292": {
294
+ "content": "</section_header_level_4>",
295
+ "lstrip": false,
296
+ "normalized": false,
297
+ "rstrip": false,
298
+ "single_word": false,
299
+ "special": true
300
+ },
301
+ "100293": {
302
+ "content": "<section_header_level_5>",
303
+ "lstrip": false,
304
+ "normalized": false,
305
+ "rstrip": false,
306
+ "single_word": false,
307
+ "special": true
308
+ },
309
+ "100294": {
310
+ "content": "</section_header_level_5>",
311
+ "lstrip": false,
312
+ "normalized": false,
313
+ "rstrip": false,
314
+ "single_word": false,
315
+ "special": true
316
+ },
317
+ "100295": {
318
+ "content": "<section_header_level_6>",
319
+ "lstrip": false,
320
+ "normalized": false,
321
+ "rstrip": false,
322
+ "single_word": false,
323
+ "special": true
324
+ },
325
+ "100296": {
326
+ "content": "</section_header_level_6>",
327
+ "lstrip": false,
328
+ "normalized": false,
329
+ "rstrip": false,
330
+ "single_word": false,
331
+ "special": true
332
+ },
333
+ "100297": {
334
+ "content": "<otsl>",
335
+ "lstrip": false,
336
+ "normalized": false,
337
+ "rstrip": false,
338
+ "single_word": false,
339
+ "special": true
340
+ },
341
+ "100298": {
342
+ "content": "</otsl>",
343
+ "lstrip": false,
344
+ "normalized": false,
345
+ "rstrip": false,
346
+ "single_word": false,
347
+ "special": true
348
+ },
349
+ "100299": {
350
+ "content": "<checkbox_selected>",
351
+ "lstrip": false,
352
+ "normalized": false,
353
+ "rstrip": false,
354
+ "single_word": false,
355
+ "special": true
356
+ },
357
+ "100300": {
358
+ "content": "</checkbox_selected>",
359
+ "lstrip": false,
360
+ "normalized": false,
361
+ "rstrip": false,
362
+ "single_word": false,
363
+ "special": true
364
+ },
365
+ "100301": {
366
+ "content": "<checkbox_unselected>",
367
+ "lstrip": false,
368
+ "normalized": false,
369
+ "rstrip": false,
370
+ "single_word": false,
371
+ "special": true
372
+ },
373
+ "100302": {
374
+ "content": "</checkbox_unselected>",
375
+ "lstrip": false,
376
+ "normalized": false,
377
+ "rstrip": false,
378
+ "single_word": false,
379
+ "special": true
380
+ },
381
+ "100303": {
382
+ "content": "<form>",
383
+ "lstrip": false,
384
+ "normalized": false,
385
+ "rstrip": false,
386
+ "single_word": false,
387
+ "special": true
388
+ },
389
+ "100304": {
390
+ "content": "</form>",
391
+ "lstrip": false,
392
+ "normalized": false,
393
+ "rstrip": false,
394
+ "single_word": false,
395
+ "special": true
396
+ },
397
+ "100305": {
398
+ "content": "<key_value_region>",
399
+ "lstrip": false,
400
+ "normalized": false,
401
+ "rstrip": false,
402
+ "single_word": false,
403
+ "special": true
404
+ },
405
+ "100306": {
406
+ "content": "</key_value_region>",
407
+ "lstrip": false,
408
+ "normalized": false,
409
+ "rstrip": false,
410
+ "single_word": false,
411
+ "special": true
412
+ },
413
+ "100307": {
414
+ "content": "<key_",
415
+ "lstrip": false,
416
+ "normalized": false,
417
+ "rstrip": false,
418
+ "single_word": false,
419
+ "special": true
420
+ },
421
+ "100308": {
422
+ "content": "</key_",
423
+ "lstrip": false,
424
+ "normalized": false,
425
+ "rstrip": false,
426
+ "single_word": false,
427
+ "special": true
428
+ },
429
+ "100309": {
430
+ "content": "<value_",
431
+ "lstrip": false,
432
+ "normalized": false,
433
+ "rstrip": false,
434
+ "single_word": false,
435
+ "special": true
436
+ },
437
+ "100310": {
438
+ "content": "</value_",
439
+ "lstrip": false,
440
+ "normalized": false,
441
+ "rstrip": false,
442
+ "single_word": false,
443
+ "special": true
444
+ },
445
+ "100311": {
446
+ "content": "<link_",
447
+ "lstrip": false,
448
+ "normalized": false,
449
+ "rstrip": false,
450
+ "single_word": false,
451
+ "special": true
452
+ },
453
+ "100312": {
454
+ "content": "<chart>",
455
+ "lstrip": false,
456
+ "normalized": false,
457
+ "rstrip": false,
458
+ "single_word": false,
459
+ "special": true
460
+ },
461
+ "100313": {
462
+ "content": "</chart>",
463
+ "lstrip": false,
464
+ "normalized": false,
465
+ "rstrip": false,
466
+ "single_word": false,
467
+ "special": true
468
+ },
469
+ "100314": {
470
+ "content": "<page_break>",
471
+ "lstrip": false,
472
+ "normalized": false,
473
+ "rstrip": false,
474
+ "single_word": false,
475
+ "special": true
476
+ },
477
+ "100315": {
478
+ "content": "<smiles>",
479
+ "lstrip": false,
480
+ "normalized": false,
481
+ "rstrip": false,
482
+ "single_word": false,
483
+ "special": true
484
+ },
485
+ "100316": {
486
+ "content": "</smiles>",
487
+ "lstrip": false,
488
+ "normalized": false,
489
+ "rstrip": false,
490
+ "single_word": false,
491
+ "special": true
492
+ },
493
+ "100317": {
494
+ "content": "</text>",
495
+ "lstrip": false,
496
+ "normalized": false,
497
+ "rstrip": false,
498
+ "single_word": false,
499
+ "special": true
500
+ },
501
+ "100318": {
502
+ "content": "<paragraph>",
503
+ "lstrip": false,
504
+ "normalized": false,
505
+ "rstrip": false,
506
+ "single_word": false,
507
+ "special": true
508
+ },
509
+ "100319": {
510
+ "content": "</paragraph>",
511
+ "lstrip": false,
512
+ "normalized": false,
513
+ "rstrip": false,
514
+ "single_word": false,
515
+ "special": true
516
+ },
517
+ "100320": {
518
+ "content": "<references>",
519
+ "lstrip": false,
520
+ "normalized": false,
521
+ "rstrip": false,
522
+ "single_word": false,
523
+ "special": true
524
+ },
525
+ "100321": {
526
+ "content": "</references>",
527
+ "lstrip": false,
528
+ "normalized": false,
529
+ "rstrip": false,
530
+ "single_word": false,
531
+ "special": true
532
+ },
533
+ "100322": {
534
+ "content": "<ordered_list>",
535
+ "lstrip": false,
536
+ "normalized": false,
537
+ "rstrip": false,
538
+ "single_word": false,
539
+ "special": true
540
+ },
541
+ "100323": {
542
+ "content": "</ordered_list>",
543
+ "lstrip": false,
544
+ "normalized": false,
545
+ "rstrip": false,
546
+ "single_word": false,
547
+ "special": true
548
+ },
549
+ "100324": {
550
+ "content": "<unordered_list>",
551
+ "lstrip": false,
552
+ "normalized": false,
553
+ "rstrip": false,
554
+ "single_word": false,
555
+ "special": true
556
+ },
557
+ "100325": {
558
+ "content": "</unordered_list>",
559
+ "lstrip": false,
560
+ "normalized": false,
561
+ "rstrip": false,
562
+ "single_word": false,
563
+ "special": true
564
+ },
565
+ "100326": {
566
+ "content": "<group>",
567
+ "lstrip": false,
568
+ "normalized": false,
569
+ "rstrip": false,
570
+ "single_word": false,
571
+ "special": true
572
+ },
573
+ "100327": {
574
+ "content": "<doctag>",
575
+ "lstrip": false,
576
+ "normalized": false,
577
+ "rstrip": false,
578
+ "single_word": false,
579
+ "special": true
580
+ },
581
+ "100328": {
582
+ "content": "</doctag>",
583
+ "lstrip": false,
584
+ "normalized": false,
585
+ "rstrip": false,
586
+ "single_word": false,
587
+ "special": true
588
+ },
589
+ "100329": {
590
+ "content": "<rec_",
591
+ "lstrip": false,
592
+ "normalized": false,
593
+ "rstrip": false,
594
+ "single_word": false,
595
+ "special": true
596
+ },
597
+ "100330": {
598
+ "content": "<fcel>",
599
+ "lstrip": false,
600
+ "normalized": false,
601
+ "rstrip": false,
602
+ "single_word": false,
603
+ "special": true
604
+ },
605
+ "100331": {
606
+ "content": "<ecel>",
607
+ "lstrip": false,
608
+ "normalized": false,
609
+ "rstrip": false,
610
+ "single_word": false,
611
+ "special": true
612
+ },
613
+ "100332": {
614
+ "content": "<lcel>",
615
+ "lstrip": false,
616
+ "normalized": false,
617
+ "rstrip": false,
618
+ "single_word": false,
619
+ "special": true
620
+ },
621
+ "100333": {
622
+ "content": "<ucel>",
623
+ "lstrip": false,
624
+ "normalized": false,
625
+ "rstrip": false,
626
+ "single_word": false,
627
+ "special": true
628
+ },
629
+ "100334": {
630
+ "content": "<xcel>",
631
+ "lstrip": false,
632
+ "normalized": false,
633
+ "rstrip": false,
634
+ "single_word": false,
635
+ "special": true
636
+ },
637
+ "100335": {
638
+ "content": "<nl>",
639
+ "lstrip": false,
640
+ "normalized": false,
641
+ "rstrip": false,
642
+ "single_word": false,
643
+ "special": true
644
+ },
645
+ "100336": {
646
+ "content": "<ched>",
647
+ "lstrip": false,
648
+ "normalized": false,
649
+ "rstrip": false,
650
+ "single_word": false,
651
+ "special": true
652
+ },
653
+ "100337": {
654
+ "content": "<rhed>",
655
+ "lstrip": false,
656
+ "normalized": false,
657
+ "rstrip": false,
658
+ "single_word": false,
659
+ "special": true
660
+ },
661
+ "100338": {
662
+ "content": "<|unk|>",
663
+ "lstrip": false,
664
+ "normalized": false,
665
+ "rstrip": false,
666
+ "single_word": false,
667
+ "special": true
668
+ },
669
+ "100339": {
670
+ "content": "<fake_token_around_image>",
671
+ "lstrip": false,
672
+ "normalized": false,
673
+ "rstrip": false,
674
+ "single_word": false,
675
+ "special": true
676
+ },
677
+ "100340": {
678
+ "content": "<global-img>",
679
+ "lstrip": false,
680
+ "normalized": false,
681
+ "rstrip": false,
682
+ "single_word": false,
683
+ "special": true
684
+ },
685
+ "100341": {
686
+ "content": "<row_2_col_4>",
687
+ "lstrip": false,
688
+ "normalized": false,
689
+ "rstrip": false,
690
+ "single_word": false,
691
+ "special": true
692
+ },
693
+ "100342": {
694
+ "content": "<row_3_col_1>",
695
+ "lstrip": false,
696
+ "normalized": false,
697
+ "rstrip": false,
698
+ "single_word": false,
699
+ "special": true
700
+ },
701
+ "100343": {
702
+ "content": "<row_3_col_2>",
703
+ "lstrip": false,
704
+ "normalized": false,
705
+ "rstrip": false,
706
+ "single_word": false,
707
+ "special": true
708
+ },
709
+ "100344": {
710
+ "content": "<row_3_col_3>",
711
+ "lstrip": false,
712
+ "normalized": false,
713
+ "rstrip": false,
714
+ "single_word": false,
715
+ "special": true
716
+ },
717
+ "100345": {
718
+ "content": "<row_3_col_4>",
719
+ "lstrip": false,
720
+ "normalized": false,
721
+ "rstrip": false,
722
+ "single_word": false,
723
+ "special": true
724
+ },
725
+ "100346": {
726
+ "content": "<row_4_col_1>",
727
+ "lstrip": false,
728
+ "normalized": false,
729
+ "rstrip": false,
730
+ "single_word": false,
731
+ "special": true
732
+ },
733
+ "100347": {
734
+ "content": "<row_4_col_2>",
735
+ "lstrip": false,
736
+ "normalized": false,
737
+ "rstrip": false,
738
+ "single_word": false,
739
+ "special": true
740
+ },
741
+ "100348": {
742
+ "content": "<row_4_col_3>",
743
+ "lstrip": false,
744
+ "normalized": false,
745
+ "rstrip": false,
746
+ "single_word": false,
747
+ "special": true
748
+ },
749
+ "100349": {
750
+ "content": "<row_4_col_4>",
751
+ "lstrip": false,
752
+ "normalized": false,
753
+ "rstrip": false,
754
+ "single_word": false,
755
+ "special": true
756
+ },
757
+ "100350": {
758
+ "content": "<code>",
759
+ "lstrip": false,
760
+ "normalized": false,
761
+ "rstrip": false,
762
+ "single_word": false,
763
+ "special": true
764
+ },
765
+ "100351": {
766
+ "content": "</code>",
767
+ "lstrip": false,
768
+ "normalized": false,
769
+ "rstrip": false,
770
+ "single_word": false,
771
+ "special": true
772
+ }
773
+ },
774
+ "additional_special_tokens": [
775
+ "<fake_token_around_image>",
776
+ "<image>"
777
+ ],
778
+ "bos_token": "<|start_of_role|>",
779
+ "clean_up_tokenization_spaces": false,
780
+ "eos_token": "<|end_of_text|>",
781
+ "errors": "replace",
782
+ "extra_special_tokens": {},
783
+ "model_max_length": 8192,
784
+ "pad_token": "<|end_of_text|>",
785
+ "padding_side": "left",
786
+ "processor_class": "Idefics3Processor",
787
+ "tokenizer_class": "GPT2Tokenizer",
788
+ "unk_token": "<|unk|>"
789
+ }
vocab.json ADDED
The diff for this file is too large to render. See raw diff