mlconti commited on
Commit
be3bf7a
·
verified ·
1 Parent(s): 4ddb7ed

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md ADDED
@@ -0,0 +1,114 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: colpali
4
+ base_model: vidore/colmodernvbert-base
5
+ language:
6
+ - en
7
+ tags:
8
+ - colpali
9
+ - vidore-experimental
10
+ - vidore
11
+ pipeline_tag: visual-document-retrieval
12
+ ---
13
+ # ColModernVBERT
14
+
15
+ <div>
16
+ <img src="bg.png" width="100%" alt="ModernVBERT" />
17
+ </div>
18
+
19
+ ## Model
20
+ This is the model card for `ColModernVBERT`, the late-interaction version of ModernVBERT that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
21
+ This is the version with LoRA adapters merged with the base model.
22
+
23
+ ## Table of Contents
24
+ 1. [Overview](#overview)
25
+ 2. [Usage](#Usage)
26
+ 3. [Evaluation](#Evaluation)
27
+ 4. [License](#license)
28
+ 5. [Citation](#citation)
29
+
30
+ ## Overview
31
+
32
+ The [ModernVBERT](https://arxiv.org/abs/2510.01149) suite is a suite of compact 250M-parameter vision-language encoders, achieving state-of-the-art performance in this size class, matching the performance of models up to 10x larger.
33
+
34
+ For more information about ModernVBERT, please check the [arXiv](https://arxiv.org/abs/2510.01149) preprint.
35
+
36
+ ### Models
37
+ - `ColModernVBERT` is the late-interaction version that is fine-tuned for visual document retrieval tasks, our most performant model on this task.
38
+ - `BiModernVBERT` is the bi-encoder version that is fine-tuned for visual document retrieval tasks.
39
+ - `ModernVBERT-embed` is the bi-encoder version after modality alignment (using a MLM objective) and contrastive learning, without document specialization.
40
+ - `ModernVBERT` is the base model after modality alignment (using a MLM objective).
41
+
42
+
43
+ ## Usage
44
+
45
+ **🏎️ If your GPU supports it, we recommend using ModernVBERT with Flash Attention 2 to achieve the highest GPU throughput. To do so, install Flash Attention 2 as follows, then use the model as normal:**
46
+
47
+ For now, the branch for using colmdernvbert is not yet merged in the official colpali repo, you need to clone the repo and checkout on the right branch to use it.
48
+
49
+ ```bash
50
+ git clone https://github.com/illuin-tech/colpali.git
51
+ cd colpali
52
+ git checkout vbert
53
+ pip install -e .
54
+ ```
55
+
56
+ Here is an example of masked token prediction using ModernVBERT:
57
+
58
+ ```python
59
+ import torch
60
+ from colpali_engine.models import ColModernVBert, ColModernVBertProcessor
61
+ from PIL import Image
62
+ from huggingface_hub import hf_hub_download
63
+
64
+ model_id = "ModernVBERT/colmodernvbert-merged"
65
+
66
+ processor = ColModernVBertProcessor.from_pretrained(model_id)
67
+ model = ColModernVBert.from_pretrained(
68
+ model_id,
69
+ torch_dtype=torch.float32,
70
+ trust_remote_code=True
71
+ )
72
+
73
+ image = Image.open(hf_hub_download("HuggingFaceTB/SmolVLM", "example_images/rococo.jpg", repo_type="space"))
74
+ text = "This is a text"
75
+
76
+ # Prepare inputs
77
+ text_inputs = processor.process_texts([text])
78
+ image_inputs = processor.process_images([image])
79
+
80
+ # Inference
81
+ q_embeddings = model(**text_inputs)
82
+ corpus_embeddings = model(**image_inputs)
83
+
84
+ # Get the similarity scores
85
+ scores = processor.score(q_embeddings, corpus_embeddings)
86
+
87
+ print("Similarity scores:", scores)
88
+ ```
89
+
90
+ ## Evaluation
91
+ <div>
92
+ <img src="table.png" width="100%" alt="ModernVBERT Results" />
93
+ </div>
94
+ ColModernVBERT matches the performance of models nearly 10x larger on visual document benchmarks. Additionally, it provides an interesting inference speed on CPU compared to the models of similar performance.
95
+
96
+ ## License
97
+
98
+ We release the ModernVBERT model architectures, model weights, and training codebase under the MIT license.
99
+
100
+ ## Citation
101
+
102
+ If you use ModernVBERT in your work, please cite:
103
+
104
+ ```
105
+ @misc{teiletche2025modernvbertsmallervisualdocument,
106
+ title={ModernVBERT: Towards Smaller Visual Document Retrievers},
107
+ author={Paul Teiletche and Quentin Macé and Max Conti and Antonio Loison and Gautier Viaud and Pierre Colombo and Manuel Faysse},
108
+ year={2025},
109
+ eprint={2510.01149},
110
+ archivePrefix={arXiv},
111
+ primaryClass={cs.IR},
112
+ url={https://arxiv.org/abs/2510.01149},
113
+ }
114
+ ```