Improve model card: Add transformers, image-text-to-text tags, paper, project page, and usage example

#1
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +283 -3
README.md CHANGED
@@ -1,3 +1,283 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ library_name: transformers
4
+ pipeline_tag: image-text-to-text
5
+ ---
6
+
7
+ # NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints (NeurIPS 2025)
8
+
9
+ [📜 Paper](https://huggingface.co/papers/2510.08565) | [⭐️ Project Page](https://internvl.github.io/blog/2025-10-10-NaViL/) | [💻 GitHub Repository](https://github.com/OpenGVLab/NaViL) | [🤗 Models](https://huggingface.co/collections/OpenGVLab/navil-68e62e7d20ea3e4097b56778) | [📝 中文版](https://github.com/OpenGVLab/NaViL/blob/main/README-zh.md)
10
+
11
+ ## Abstract
12
+ Compositional training has been the de-facto paradigm in existing Multimodal Large Language Models (MLLMs), where pre-trained vision encoders are connected with pre-trained LLMs through continuous multimodal pre-training. However, the multimodal scaling property of this paradigm remains difficult to explore due to the separated training. In this paper, we focus on the native training of MLLMs in an end-to-end manner and systematically study its design space and scaling property under a practical setting, i.e., data constraint. Through careful study of various choices in MLLM, we obtain the optimal meta-architecture that best balances performance and training cost. After that, we further explore the scaling properties of the native MLLM and indicate the positively correlated scaling relationship between visual encoders and LLMs. Based on these findings, we propose a native MLLM called NaViL, combined with a simple and cost-effective recipe. Experimental results on 14 multimodal benchmarks confirm the competitive performance of NaViL against existing MLLMs. Besides that, our findings and results provide in-depth insights for the future study of native MLLMs.
13
+
14
+ ## 💡 Core Insights
15
+
16
+ We conducted a systematic study on the design and scaling properties of native MLLMs, leading to five key conclusions that guided the design of NaViL:
17
+
18
+ 1. **LLM Initialization is Crucial**: Initializing the model from a pre-trained LLM significantly accelerates the convergence of multimodal training. Its performance is generally superior to training from scratch, even with a large amount of multimodal data.
19
+
20
+ <p align="center">
21
+ <img src="https://huggingface.co/OpenGVLab/NaViL/resolve/main/images/comparison_llm_init.png" alt="LLM Initialization Comparison" style="width: 80%; height: auto;" />
22
+ </p>
23
+
24
+ 2. **MoE Architecture is Effective**: The Mixture-of-Experts (MoE) architecture can significantly enhance the model's ability to process heterogeneous data and improve overall performance without increasing inference costs (activated parameters). We found that introducing modality-specific experts for both attention and feed-forward networks (FFN) yields the best results.
25
+
26
+ <p align="center">
27
+ <img src="https://huggingface.co/OpenGVLab/NaViL/resolve/main/images/comparison_moe.png" alt="MoE Architecture Comparison" style="width: 60%; height: auto;" />
28
+ </p>
29
+
30
+ 3. **Flexibility of Visual Encoder Architecture**: For a given parameter budget, the performance of the visual encoder is nearly optimal across a wide range of depth and width configurations. Shallower encoders converge faster in the early stages of training, while deeper encoders perform slightly better with more data.
31
+
32
+ 4. **Asymmetric Scaling Effects**: Scaling up the LLM consistently improves multimodal performance, following traditional language model scaling laws. However, the benefits of scaling the visual encoder diminish, with its performance ceiling being constrained by the LLM's capacity.
33
+
34
+ 5. **Joint Scaling Law for Vision and Language**: Our research reveals for the first time that **the optimal scale of the visual encoder is directly proportional to the scale of the LLM on a logarithmic scale**. This implies that they should be scaled jointly and highlights the sub-optimality of existing compositional MLLMs that pair a fixed-size visual encoder with LLMs of different sizes.
35
+
36
+ <p align="center">
37
+ <img src="https://huggingface.co/OpenGVLab/NaViL/resolve/main/images/comparison_vit_size_vs_llm_size.png" alt="Visual Encoder vs LLM Scaling" style="width: 60%; height: auto;" />
38
+ </p>
39
+
40
+ For more details, please refer to the original [paper](https://huggingface.co/papers/2510.08565).
41
+
42
+ ## 🏗️ NaViL Architecture
43
+
44
+ Based on the insights above, we built NaViL. It is a native, MoE-based MLLM that can be trained end-to-end and natively supports images of arbitrary resolutions.
45
+
46
+ <p align="center">
47
+ <img src="https://huggingface.co/OpenGVLab/NaViL/resolve/main/images/arch.png" alt="NaViL Architecture Diagram" style="width: 100%; height: auto;" />
48
+ </p>
49
+
50
+ - **Visual Encoder**: Responsible for the initial extraction of visual information.
51
+ - **MLP Connector**: Projects visual features into the LLM's feature space.
52
+ - **MoE-extended LLM**: Contains modality-specific attention (MHA-MMoE) and feed-forward networks (FFN-MMoE) to fuse visual and text information more effectively.
53
+ - **Visual Multi-scale Packing**: Further enhances model performance during inference by processing image inputs at multiple scales.
54
+
55
+ ## 📊 Main Results
56
+
57
+ We conducted a comprehensive evaluation of NaViL on 14 mainstream multimodal benchmarks, covering general capabilities, visual question answering, OCR, chart, and document understanding.
58
+
59
+ ### Comparison with SOTA Models
60
+
61
+ With comparable parameter sizes, NaViL-2B and NaViL-9B **surpass all existing native MLLMs in average performance** and achieve a level comparable to top-tier compositional MLLMs (e.g., InternVL-2.5, Qwen2.5-VL). This demonstrates the superiority of our proposed native training paradigm and scaling laws.
62
+
63
+ | Model | #A-Param | Avg | MMVet | MMMU | MMB | MME | MathVista | OCR-Bench | TextVQA | DocVQA | AI2D | ChartQA | InfoVQA |
64
+ | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- |
65
+ | **_Compositional MLLMs_** |
66
+ | [Qwen2.5-VL](https://github.com/QwenLM/Qwen-VL) | 8.2B | 80.2 | 67.1 | 58.6 | 83.5 | 2347 | 68.2 | 864 | 84.9 | 95.7 | 83.9 | 87.3 | 82.6 |
67
+ | [InternVL-2.5](https://github.com/OpenGVLab/InternVL) | 8.1B | 77.3 | 62.8 | 56.0 | 84.6 | 2344 | 64.4 | 822 | 79.1 | 91.9 | 84.5 | 84.8 | 75.7 |
68
+ | **_Native MLLMs_** |
69
+ | [EVEv2](https://github.com/baaivision/EVE) | 7B | 62.3 | 45.0 | 39.3 | 66.3 | 1709 | 60.0\* | 702 | 71.1 | 77.4\* | 74.8 | 73.9 | 45.8\* |
70
+ | [SAIL](https://github.com/ByteDance-Seed/SAIL) | 7B | 63.7 | 46.3 | 38.6\* | 70.1 | 1719 | 57.0 | 783 | 77.1 | 78.4\* | 76.7 | 69.7\* | 47.3\* |
71
+ | **NaViL-2B (ours)** | **2.4B** | **68.8** | **78.3** | **41.8** | **71.2** | **1822** | **50.0** | **796** | **76.9** | **85.4** | **74.6** | **78.0** | **56.0** |
72
+ | **NaViL-9B (ours)** | **9.2B** | **77.0** | **79.6** | **54.7** | **76.5** | **2225** | **66.7** | **837** | **77.2** | **90.6** | **82.4** | **85.4** | **70.2** |
73
+
74
+ > * \* denotes results tested locally using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) and [OpenCompass](https://rank.opencompass.org.cn/leaderboard-multimodal/?m=REALTIME).
75
+ > * The average score is computed by normalizing each metric to a range of 0-100.
76
+
77
+ ### Qualitative Analysis
78
+
79
+ By visualizing attention maps, we found that a sufficiently large visual encoder (following our joint scaling law) helps the model focus on global information in shallower layers and promotes earlier interaction between visual and text features, which explains the performance improvement.
80
+
81
+ <p align="center">
82
+ <img src="https://huggingface.co/OpenGVLab/NaViL/resolve/main/images/visualization_attention_matrix.png" alt="Attention Map Visualization" style="width: 100%; height: auto;" />
83
+ </p>
84
+ * Top: Using a 150M visual encoder; Bottom: Using a 1.2B visual encoder. The latter exhibits stronger global attention and cross-modal interaction even in shallow layers (Layer 1).*
85
+
86
+ ## 🚀 Getting Started
87
+
88
+ ```bash
89
+ # 1. Clone the repository
90
+ git clone https://github.com/OpenGVLab/NaViL.git
91
+ cd NaViL
92
+
93
+ # 2. Create and activate the conda environment
94
+ conda create -n navil python=3.10 -y
95
+ conda activate navil
96
+
97
+ # 3. Install dependencies
98
+ pip install -r requirements.txt
99
+
100
+ # 4. run the inference demo
101
+
102
+ ## 2B version
103
+ python -u demo.py --model_name_or_path OpenGVLab/NaViL-2B
104
+ ## 9B version
105
+ python -u demo.py --model_name_or_path OpenGVLab/NaViL-9B
106
+ ```
107
+
108
+ ## ✨ Inference Example
109
+
110
+ Here is an example code for multimodal question answering with NaViL using the `transformers` library.
111
+
112
+ > Please use `transformers==4.51.0` to ensure the model works correctly.
113
+
114
+ <details>
115
+ <summary>Inference Example Code (Click to expand)</summary>
116
+
117
+ ```python
118
+ import torch
119
+ from transformers import AutoTokenizer, AutoModel
120
+ from PIL import Image
121
+
122
+ def anyres_preprocess_multi_scale(images, image_processor, max_pixels=-1, min_pixels=-1, scale_downsample_ratio=0.7071):
123
+ assert min_pixels > 0 and max_pixels > 0, 'min_pixels and max_pixels must be set'
124
+ if not isinstance(images, list):
125
+ images = [images]
126
+
127
+ pixel_values_all, image_grid_thws_all, num_scales_all = [], [], []
128
+ for image in images:
129
+ ret = image_processor(image, return_tensors="pt", min_pixels=min_pixels, max_pixels=max_pixels)
130
+ image_grid_thws = [ret['image_grid_thw'][0]]
131
+ pixel_values = ret['pixel_values'].reshape(ret['image_grid_thw'].prod(), -1, image_processor.patch_size, image_processor.patch_size)
132
+
133
+ while True:
134
+ current_pixels = image_grid_thws[0].prod() * (image_processor.patch_size ** 2)
135
+ max_pixels = current_pixels * (scale_downsample_ratio ** 2)
136
+ if max_pixels < min_pixels:
137
+ break
138
+ ret = image_processor(image, return_tensors="pt", min_pixels=min_pixels, max_pixels=max_pixels)
139
+ if ret['image_grid_thw'].prod() >= image_grid_thws[0].prod():
140
+ break
141
+ image_grid_thws.insert(0, ret['image_grid_thw'][0])
142
+ pixel_values = torch.cat([ret['pixel_values'].reshape(ret['image_grid_thw'].prod(), -1, image_processor.patch_size, image_processor.patch_size), pixel_values], dim=0)
143
+
144
+ pixel_values_all.append(pixel_values)
145
+ image_grid_thws_all.extend(image_grid_thws)
146
+ num_scales_all.append(len(image_grid_thws))
147
+ pixel_values = torch.cat(pixel_values_all, dim=0)
148
+ return pixel_values, image_grid_thws_all, num_scales_all
149
+
150
+
151
+ def load_image(
152
+ image_files,
153
+ image_processor,
154
+ patch_size=16,
155
+ max_num=24576,
156
+ min_num=256,
157
+ upscale=False,
158
+ scale_downsample_ratio=0.7071,
159
+ ):
160
+
161
+ if not isinstance(image_files, list):
162
+ image_files = [image_files]
163
+
164
+ images = []
165
+ for image_file in image_files:
166
+ image = Image.open(image_file).convert('RGB')
167
+ if upscale:
168
+ image = image.resize((image.width * 2, image.height * 2), Image.BILINEAR)
169
+ images.append(image)
170
+
171
+ min_pixels = min_num * (patch_size ** 2)
172
+ max_pixels = max_num * (patch_size ** 2)
173
+ pixel_values, image_grid_thws, num_scales = anyres_preprocess_multi_scale(
174
+ images=images,
175
+ image_processor=image_processor,
176
+ max_pixels=max_pixels,
177
+ min_pixels=min_pixels,
178
+ scale_downsample_ratio=scale_downsample_ratio,
179
+ )
180
+
181
+ image_grid_thws = torch.stack(image_grid_thws)
182
+ num_scales = torch.tensor(num_scales)
183
+ return pixel_values, image_grid_thws, num_scales
184
+
185
+
186
+ def load_model_tokenizer(model_path):
187
+ tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
188
+
189
+ device = torch.cuda.current_device()
190
+ model = AutoModel.from_pretrained(
191
+ model_path,
192
+ low_cpu_mem_usage=True,
193
+ torch_dtype=torch.bfloat16,
194
+ trust_remote_code=True,
195
+ load_in_8bit=False
196
+ ).eval()
197
+ model.init_special_token_ids(tokenizer)
198
+
199
+ # fix bug caused by size mismatch
200
+ if hasattr(model.config, "tie_word_embeddings") and model.config.tie_word_embeddings:
201
+ model.language_model.tie_weights()
202
+
203
+ model = model.to(device)
204
+
205
+ return model, tokenizer
206
+
207
+
208
+ def generate(message, model, tokenizer):
209
+ image_num = len([x for x in message if x['type'] == 'image'])
210
+ prompt = '
211
+ '.join([x['value'] for x in message if x['type'] == 'text'])
212
+
213
+ if image_num > 0:
214
+ image_paths = [x['value'] for x in message if x['type'] == 'image']
215
+ pixel_values, image_grid_thws, num_scales = load_image(
216
+ image_paths,
217
+ model.image_processor,
218
+ max_num=model.config.max_dynamic_patch,
219
+ min_num=model.config.min_dynamic_patch,
220
+ patch_size=model.config.vision_config.patch_size,
221
+ scale_downsample_ratio=model.config.scale_downsample_ratio,
222
+ )
223
+ pixel_values = pixel_values.cuda().to(torch.bfloat16)
224
+ image_grid_thws = image_grid_thws.cuda()
225
+ num_scales = num_scales.cuda()
226
+ else:
227
+ pixel_values, image_grid_thws, num_scales = None, None, None
228
+
229
+ generation_config = dict(do_sample=False, max_new_tokens=1024, top_p=None, num_beams=1)
230
+ with torch.no_grad():
231
+ try:
232
+ response = model.chat(
233
+ tokenizer,
234
+ pixel_values=pixel_values,
235
+ question=prompt,
236
+ generation_config=generation_config,
237
+ verbose=True,
238
+ anyres_image_size=True,
239
+ num_patches_list=image_grid_thws,
240
+ num_scales=num_scales,
241
+ )
242
+ except Exception as e:
243
+ print(f"Error in model chat: {e}")
244
+ raise e
245
+ return response
246
+
247
+ # --- Main Program ---
248
+ # Select the model to load
249
+ # model_path = "OpenGVLab/NaViL-2B"
250
+ model_path = "OpenGVLab/NaViL-9B"
251
+
252
+ print(f"Loading model from {model_path}...")
253
+ model, tokenizer = load_model_tokenizer(model_path)
254
+
255
+ # Prepare the input message
256
+ # The input format is a list of dictionaries, supporting multiple images and text segments
257
+ message = [
258
+ {"type": "image", "value": "./examples/image1.jpg"},
259
+ {"type": "text", "value": "Please describe the image shortly."},
260
+ ]
261
+
262
+ print("Generating response...")
263
+ response = generate(message, model, tokenizer)
264
+
265
+ print("
266
+ === Response ===")
267
+ print(response)
268
+ ```
269
+
270
+ </details>
271
+
272
+ ## ✍️ How to Cite
273
+
274
+ If you find NaViL or our findings useful in your research, please consider citing our paper:
275
+
276
+ ```bibtex
277
+ @article{tian2025navil,
278
+ title={NaViL: Rethinking Scaling Properties of Native Multimodal Large Language Models under Data Constraints},
279
+ author={Tian, Changyao and Li, Hao and Luo, Gen and Zhu, Xizhou and Su, Weijie and Deng, Hanming and Zhu, Jinguo and Shao, Jie and Zhu, Ziran and Liu, Yunpeng and Lu, Lewei and Wang, Wenhai and Li, Hongsheng and Dai, Jifeng},
280
+ journal={arXiv preprint},
281
+ year={2025}
282
+ }
283
+ ```