Files changed (1) hide show
  1. README.md +391 -379
README.md CHANGED
@@ -1,380 +1,392 @@
1
- ---
2
- license: cc-by-nc-4.0
3
- pipeline_tag: image-text-to-text
4
- library_name: transformers
5
- base_model:
6
- - google/paligemma-3b-mix-448
7
- - Qwen/Qwen2.5-0.5B-Instruct
8
- - google/siglip-so400m-patch14-384
9
- base_model_relation: merge
10
- language:
11
- - multilingual
12
- tags:
13
- - eagle
14
- - VLM
15
- ---
16
-
17
-
18
- # Eagle-2
19
-
20
-
21
- [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](http://arxiv.org/abs/2501.14818)
22
- [\[🤗 HF Demo\]](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
23
-
24
- # News:
25
- - We update the model arch to `eagle_2_5_vl` to support `generate` feature.
26
-
27
- ## Introduction
28
-
29
- We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
30
-
31
-
32
- In this repo, we are open-sourcing Eagle2-1B, a compact and efficient model designed for scenarios requiring fast inference and minimal computational resources, without compromising essential performance
33
-
34
-
35
-
36
-
37
-
38
-
39
-
40
-
41
- ## Model Zoo
42
- We provide the following models:
43
-
44
- | model name | LLM | Vision | Max Length| HF Link|
45
- | ----------- | ------- |---------|-|-|
46
- | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
47
- | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
48
- | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
49
-
50
- ## Benchmark Results
51
- | Benchmark | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B |Qwen2-VL-2B| Eagle2-1B|
52
- | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
53
- | DocVQA<sub>test</sub> | 70.0 | 81.7 | 84.8 |90.1|81.8|
54
- | ChartQA<sub>test</sub> | 61.4 | 72.9 | 75.9 |73.0|77.0|
55
- | InfoVQA<sub>test</sub> | 41.8 | 50.9 | 56.0 |65.5|54.8|
56
- | TextVQA<sub>val</sub> | - | 70.0 | 72.0 |79.7|76.6|
57
- | OCRBench | 565 | 754 | 785 |809|767|
58
- | MME<sub>sum</sub> | 1438.0 | 1794.4 | 1950.5 | 1872.0| 1790.2|
59
- | RealWorldQA | 55.6 | 50.3 | 57.5 |62.6|55.4|
60
- | AI2D<sub>test</sub> | 57.1 | 64.1 | 69.3 | 74.7 |70.9|
61
- | MMMU<sub>val</sub> | 31.4 | 36.7 | 40.9 |41.1|38.8|
62
- | MMVet<sub>GPT-4-Turbo</sub> | 32.2 | 32.7 | 48.8 | 49.5|40.9| HallBench<sub>avg</sub> | 27.9 | 34.0 | 39.0 |**41.7**|35.3
63
- | MathVista<sub>testmini</sub> | 33.8 | 37.7 | 43.2 |43.0|45.3|
64
- | MMstar | 37.7 | 45.7 | 50.1|48.0|48.5|
65
-
66
-
67
-
68
- ## Quick Start
69
-
70
-
71
-
72
- We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
73
- - pure text input
74
- - single image input
75
- - multiple image input
76
- - video input
77
-
78
- ### Install the dependencies
79
-
80
- ```bash
81
- pip install transformers
82
- pip install flash-attn
83
- ```
84
-
85
-
86
- ### single image
87
-
88
- ```python
89
- from PIL import Image
90
- import requests
91
- from transformers import AutoProcessor, AutoModel
92
- import torch
93
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
94
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
95
- processor.tokenizer.padding_side = "left"
96
-
97
- messages = [
98
- {
99
- "role": "user",
100
- "content": [
101
- {
102
- "type": "image",
103
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
104
- },
105
- {"type": "text", "text": "Describe this image."},
106
- ],
107
- }
108
- ]
109
-
110
- text_list = [processor.apply_chat_template(
111
- messages, tokenize=False, add_generation_prompt=True
112
- )]
113
- image_inputs, video_inputs = processor.process_vision_info(messages)
114
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
115
- inputs = inputs.to("cuda")
116
- model = model.to("cuda")
117
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
118
- output_text = processor.batch_decode(
119
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
120
- )
121
- print(output_text)
122
- ```
123
-
124
- ### stream generation
125
-
126
- ```python
127
- from PIL import Image
128
- import requests
129
- from transformers import AutoProcessor, AutoModel, AutoTokenizer
130
- import torch
131
-
132
- from transformers import TextIteratorStreamer
133
- import threading
134
-
135
-
136
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
137
- tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
138
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
139
- processor.tokenizer.padding_side = "left"
140
-
141
- messages = [
142
- {
143
- "role": "user",
144
- "content": [
145
- {
146
- "type": "image",
147
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
148
- },
149
- {"type": "text", "text": "Describe this image."},
150
- ],
151
- }
152
- ]
153
-
154
- text_list = [processor.apply_chat_template(
155
- messages, tokenize=False, add_generation_prompt=True
156
- )]
157
- image_inputs, video_inputs = processor.process_vision_info(messages)
158
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
159
- inputs = inputs.to("cuda")
160
- model = model.to("cuda")
161
-
162
- streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
163
-
164
- generation_kwargs = dict(
165
- **inputs,
166
- streamer=streamer,
167
- max_new_tokens=1024,
168
- do_sample=True,
169
- top_p=0.95,
170
- temperature=0.8
171
- )
172
- thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
173
- thread.start()
174
-
175
-
176
- for new_text in streamer:
177
- print(new_text, end="", flush=True)
178
- ```
179
-
180
- ### multiple-images
181
-
182
- ```python
183
- from PIL import Image
184
- import requests
185
- from transformers import AutoProcessor, AutoModel
186
- import torch
187
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
188
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
189
- processor.tokenizer.padding_side = "left"
190
-
191
- messages = [
192
- {
193
- "role": "user",
194
- "content": [
195
- {
196
- "type": "image",
197
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
198
- },
199
- {
200
- "type": "image",
201
- "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
202
- },
203
- {"type": "text", "text": "Describe these two images."},
204
- ],
205
- }
206
- ]
207
-
208
- text_list = [processor.apply_chat_template(
209
- messages, tokenize=False, add_generation_prompt=True
210
- )]
211
- image_inputs, video_inputs = processor.process_vision_info(messages)
212
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
213
- inputs = inputs.to("cuda")
214
- model = model.to("cuda")
215
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
216
- output_text = processor.batch_decode(
217
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
218
- )
219
- print(output_text)
220
- ```
221
-
222
- ### single video
223
-
224
- ```python
225
-
226
- from PIL import Image
227
- import requests
228
- from transformers import AutoProcessor, AutoModel
229
- import torch
230
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
231
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
232
- processor.tokenizer.padding_side = "left"
233
-
234
- messages = [
235
- {
236
- "role": "user",
237
- "content": [
238
- {
239
- "type": "video",
240
- "video": "../Eagle2-8B/space_woaudio.mp4",
241
- },
242
- {"type": "text", "text": "Describe this video."},
243
- ],
244
- }
245
- ]
246
-
247
- text_list = [processor.apply_chat_template(
248
- messages, tokenize=False, add_generation_prompt=True
249
- )]
250
- image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
251
-
252
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
253
- inputs = inputs.to("cuda")
254
- model = model.to("cuda")
255
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
256
- output_text = processor.batch_decode(
257
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
258
- )
259
- print(output_text)
260
-
261
- ```
262
-
263
- ### multieple videos
264
-
265
- ```python
266
- from PIL import Image
267
- import requests
268
- from transformers import AutoProcessor, AutoModel
269
- import torch
270
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
271
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
272
- processor.tokenizer.padding_side = "left"
273
-
274
- messages = [
275
- {
276
- "role": "user",
277
- "content": [
278
- {
279
- "type": "video",
280
- "video": "../Eagle2-8B/space_woaudio.mp4",
281
- "nframes": 10,
282
- },
283
- {
284
- "type": "video",
285
- "video": "../Eagle2-8B/video_ocr.mp4",
286
- "nframes": 10,
287
- },
288
- {"type": "text", "text": "Describe these two videos respectively."},
289
- ],
290
- }
291
- ]
292
-
293
- text_list = [processor.apply_chat_template(
294
- messages, tokenize=False, add_generation_prompt=True
295
- )]
296
- image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
297
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
298
- inputs = inputs.to("cuda")
299
- model = model.to("cuda")
300
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
301
- output_text = processor.batch_decode(
302
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
303
- )
304
- print(output_text)
305
- ```
306
-
307
- ### batch inference
308
-
309
- ```python
310
- from PIL import Image
311
- import requests
312
- from transformers import AutoProcessor, AutoModel
313
- import torch
314
- model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
315
- processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
316
- processor.tokenizer.padding_side = "left"
317
-
318
- messages1 = [
319
- {
320
- "role": "user",
321
- "content": [
322
- {
323
- "type": "image",
324
- "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
325
- },
326
- {"type": "text", "text": "Describe this image."},
327
- ],
328
- }
329
- ]
330
-
331
- messages2 = [
332
- {
333
- "role": "user",
334
- "content": [
335
- {
336
- "type": "image",
337
- "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
338
- },
339
- {"type": "text", "text": "Describe this image."},
340
- ],
341
- }
342
- ]
343
-
344
- text_list = [processor.apply_chat_template(
345
- messages, tokenize=False, add_generation_prompt=True
346
- ) for messages in [messages1, messages2]]
347
- image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
348
- inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
349
- inputs = inputs.to("cuda")
350
- model = model.to("cuda")
351
- generated_ids = model.generate(**inputs, max_new_tokens=1024)
352
- output_text = processor.batch_decode(
353
- generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
354
- )
355
- print(output_text)
356
- ```
357
-
358
-
359
- ## TODO
360
- - [ ] Support vLLM Inference
361
- - [ ] Provide AWQ Quantization Weights
362
- - [ ] Provide fine-tuning scripts
363
-
364
-
365
- ## License/Terms of Use
366
- - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
367
- - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
368
- - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
369
- - Model License of Qwen2.5-0.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE)
370
- - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
371
-
372
-
373
-
374
- ## Citation
375
-
376
- ## Ethical Considerations
377
- NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
378
-
379
- Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
 
 
 
 
 
 
 
 
 
 
 
 
380
 
 
1
+ ---
2
+ license: cc-by-nc-4.0
3
+ pipeline_tag: image-text-to-text
4
+ library_name: transformers
5
+ base_model:
6
+ - google/paligemma-3b-mix-448
7
+ - Qwen/Qwen2.5-0.5B-Instruct
8
+ - google/siglip-so400m-patch14-384
9
+ base_model_relation: merge
10
+ language:
11
+ - zho
12
+ - eng
13
+ - fra
14
+ - spa
15
+ - por
16
+ - deu
17
+ - ita
18
+ - rus
19
+ - jpn
20
+ - kor
21
+ - vie
22
+ - tha
23
+ - ara
24
+ tags:
25
+ - eagle
26
+ - VLM
27
+ ---
28
+
29
+
30
+ # Eagle-2
31
+
32
+
33
+ [\[📂 GitHub\]](https://github.com/NVlabs/EAGLE) [\[📜 Eagle2 Tech Report\]](http://arxiv.org/abs/2501.14818)
34
+ [\[🤗 HF Demo\]](https://huggingface.co/spaces/nvidia/Eagle2-Demo)
35
+
36
+ # News:
37
+ - We update the model arch to `eagle_2_5_vl` to support `generate` feature.
38
+
39
+ ## Introduction
40
+
41
+ We are thrilled to release our latest Eagle2 series Vision-Language Model. Open-source Vision-Language Models (VLMs) have made significant strides in narrowing the gap with proprietary models. However, critical details about data strategies and implementation are often missing, limiting reproducibility and innovation. In this project, we focus on VLM post-training from a data-centric perspective, sharing insights into building effective data strategies from scratch. By combining these strategies with robust training recipes and model design, we introduce Eagle2, a family of performant VLMs. Our work aims to empower the open-source community to develop competitive VLMs with transparent processes.
42
+
43
+
44
+ In this repo, we are open-sourcing Eagle2-1B, a compact and efficient model designed for scenarios requiring fast inference and minimal computational resources, without compromising essential performance
45
+
46
+
47
+
48
+
49
+
50
+
51
+
52
+
53
+ ## Model Zoo
54
+ We provide the following models:
55
+
56
+ | model name | LLM | Vision | Max Length| HF Link|
57
+ | ----------- | ------- |---------|-|-|
58
+ | Eagle2-1B | [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-1B)|
59
+ | Eagle2-2B | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | Siglip | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-2B)|
60
+ | Eagle2-9B | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) | Siglip+ConvNext | 16K| [🤗 link](https://huggingface.co/NVIDIA/Eagle2-9B)|
61
+
62
+ ## Benchmark Results
63
+ | Benchmark | LLaVa-One-Vision-0.5B | InternVL2-1B | InternVL2.5-1B |Qwen2-VL-2B| Eagle2-1B|
64
+ | :--------------------------: | :------------------: | :----------------: | :----------: |:----------: |:----------: |
65
+ | DocVQA<sub>test</sub> | 70.0 | 81.7 | 84.8 |90.1|81.8|
66
+ | ChartQA<sub>test</sub> | 61.4 | 72.9 | 75.9 |73.0|77.0|
67
+ | InfoVQA<sub>test</sub> | 41.8 | 50.9 | 56.0 |65.5|54.8|
68
+ | TextVQA<sub>val</sub> | - | 70.0 | 72.0 |79.7|76.6|
69
+ | OCRBench | 565 | 754 | 785 |809|767|
70
+ | MME<sub>sum</sub> | 1438.0 | 1794.4 | 1950.5 | 1872.0| 1790.2|
71
+ | RealWorldQA | 55.6 | 50.3 | 57.5 |62.6|55.4|
72
+ | AI2D<sub>test</sub> | 57.1 | 64.1 | 69.3 | 74.7 |70.9|
73
+ | MMMU<sub>val</sub> | 31.4 | 36.7 | 40.9 |41.1|38.8|
74
+ | MMVet<sub>GPT-4-Turbo</sub> | 32.2 | 32.7 | 48.8 | 49.5|40.9| HallBench<sub>avg</sub> | 27.9 | 34.0 | 39.0 |**41.7**|35.3
75
+ | MathVista<sub>testmini</sub> | 33.8 | 37.7 | 43.2 |43.0|45.3|
76
+ | MMstar | 37.7 | 45.7 | 50.1|48.0|48.5|
77
+
78
+
79
+
80
+ ## Quick Start
81
+
82
+
83
+
84
+ We provide a [inference script](./demo.py) to help you quickly start using the model. We support different input types:
85
+ - pure text input
86
+ - single image input
87
+ - multiple image input
88
+ - video input
89
+
90
+ ### Install the dependencies
91
+
92
+ ```bash
93
+ pip install transformers
94
+ pip install flash-attn
95
+ ```
96
+
97
+
98
+ ### single image
99
+
100
+ ```python
101
+ from PIL import Image
102
+ import requests
103
+ from transformers import AutoProcessor, AutoModel
104
+ import torch
105
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
106
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
107
+ processor.tokenizer.padding_side = "left"
108
+
109
+ messages = [
110
+ {
111
+ "role": "user",
112
+ "content": [
113
+ {
114
+ "type": "image",
115
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
116
+ },
117
+ {"type": "text", "text": "Describe this image."},
118
+ ],
119
+ }
120
+ ]
121
+
122
+ text_list = [processor.apply_chat_template(
123
+ messages, tokenize=False, add_generation_prompt=True
124
+ )]
125
+ image_inputs, video_inputs = processor.process_vision_info(messages)
126
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
127
+ inputs = inputs.to("cuda")
128
+ model = model.to("cuda")
129
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
130
+ output_text = processor.batch_decode(
131
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
132
+ )
133
+ print(output_text)
134
+ ```
135
+
136
+ ### stream generation
137
+
138
+ ```python
139
+ from PIL import Image
140
+ import requests
141
+ from transformers import AutoProcessor, AutoModel, AutoTokenizer
142
+ import torch
143
+
144
+ from transformers import TextIteratorStreamer
145
+ import threading
146
+
147
+
148
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, attn_implementation='flash_attention_2', torch_dtype=torch.bfloat16)
149
+ tokenizer = AutoTokenizer.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
150
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
151
+ processor.tokenizer.padding_side = "left"
152
+
153
+ messages = [
154
+ {
155
+ "role": "user",
156
+ "content": [
157
+ {
158
+ "type": "image",
159
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
160
+ },
161
+ {"type": "text", "text": "Describe this image."},
162
+ ],
163
+ }
164
+ ]
165
+
166
+ text_list = [processor.apply_chat_template(
167
+ messages, tokenize=False, add_generation_prompt=True
168
+ )]
169
+ image_inputs, video_inputs = processor.process_vision_info(messages)
170
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
171
+ inputs = inputs.to("cuda")
172
+ model = model.to("cuda")
173
+
174
+ streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
175
+
176
+ generation_kwargs = dict(
177
+ **inputs,
178
+ streamer=streamer,
179
+ max_new_tokens=1024,
180
+ do_sample=True,
181
+ top_p=0.95,
182
+ temperature=0.8
183
+ )
184
+ thread = threading.Thread(target=model.generate, kwargs=generation_kwargs)
185
+ thread.start()
186
+
187
+
188
+ for new_text in streamer:
189
+ print(new_text, end="", flush=True)
190
+ ```
191
+
192
+ ### multiple-images
193
+
194
+ ```python
195
+ from PIL import Image
196
+ import requests
197
+ from transformers import AutoProcessor, AutoModel
198
+ import torch
199
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
200
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
201
+ processor.tokenizer.padding_side = "left"
202
+
203
+ messages = [
204
+ {
205
+ "role": "user",
206
+ "content": [
207
+ {
208
+ "type": "image",
209
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
210
+ },
211
+ {
212
+ "type": "image",
213
+ "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
214
+ },
215
+ {"type": "text", "text": "Describe these two images."},
216
+ ],
217
+ }
218
+ ]
219
+
220
+ text_list = [processor.apply_chat_template(
221
+ messages, tokenize=False, add_generation_prompt=True
222
+ )]
223
+ image_inputs, video_inputs = processor.process_vision_info(messages)
224
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
225
+ inputs = inputs.to("cuda")
226
+ model = model.to("cuda")
227
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
228
+ output_text = processor.batch_decode(
229
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
230
+ )
231
+ print(output_text)
232
+ ```
233
+
234
+ ### single video
235
+
236
+ ```python
237
+
238
+ from PIL import Image
239
+ import requests
240
+ from transformers import AutoProcessor, AutoModel
241
+ import torch
242
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
243
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
244
+ processor.tokenizer.padding_side = "left"
245
+
246
+ messages = [
247
+ {
248
+ "role": "user",
249
+ "content": [
250
+ {
251
+ "type": "video",
252
+ "video": "../Eagle2-8B/space_woaudio.mp4",
253
+ },
254
+ {"type": "text", "text": "Describe this video."},
255
+ ],
256
+ }
257
+ ]
258
+
259
+ text_list = [processor.apply_chat_template(
260
+ messages, tokenize=False, add_generation_prompt=True
261
+ )]
262
+ image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
263
+
264
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
265
+ inputs = inputs.to("cuda")
266
+ model = model.to("cuda")
267
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
268
+ output_text = processor.batch_decode(
269
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
270
+ )
271
+ print(output_text)
272
+
273
+ ```
274
+
275
+ ### multieple videos
276
+
277
+ ```python
278
+ from PIL import Image
279
+ import requests
280
+ from transformers import AutoProcessor, AutoModel
281
+ import torch
282
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
283
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
284
+ processor.tokenizer.padding_side = "left"
285
+
286
+ messages = [
287
+ {
288
+ "role": "user",
289
+ "content": [
290
+ {
291
+ "type": "video",
292
+ "video": "../Eagle2-8B/space_woaudio.mp4",
293
+ "nframes": 10,
294
+ },
295
+ {
296
+ "type": "video",
297
+ "video": "../Eagle2-8B/video_ocr.mp4",
298
+ "nframes": 10,
299
+ },
300
+ {"type": "text", "text": "Describe these two videos respectively."},
301
+ ],
302
+ }
303
+ ]
304
+
305
+ text_list = [processor.apply_chat_template(
306
+ messages, tokenize=False, add_generation_prompt=True
307
+ )]
308
+ image_inputs, video_inputs, video_kwargs = processor.process_vision_info(messages, return_video_kwargs=True)
309
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True, videos_kwargs=video_kwargs)
310
+ inputs = inputs.to("cuda")
311
+ model = model.to("cuda")
312
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
313
+ output_text = processor.batch_decode(
314
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
315
+ )
316
+ print(output_text)
317
+ ```
318
+
319
+ ### batch inference
320
+
321
+ ```python
322
+ from PIL import Image
323
+ import requests
324
+ from transformers import AutoProcessor, AutoModel
325
+ import torch
326
+ model = AutoModel.from_pretrained("nvidia/Eagle2-1B",trust_remote_code=True, torch_dtype=torch.bfloat16)
327
+ processor = AutoProcessor.from_pretrained("nvidia/Eagle2-1B", trust_remote_code=True, use_fast=True)
328
+ processor.tokenizer.padding_side = "left"
329
+
330
+ messages1 = [
331
+ {
332
+ "role": "user",
333
+ "content": [
334
+ {
335
+ "type": "image",
336
+ "image": "https://www.ilankelman.org/stopsigns/australia.jpg",
337
+ },
338
+ {"type": "text", "text": "Describe this image."},
339
+ ],
340
+ }
341
+ ]
342
+
343
+ messages2 = [
344
+ {
345
+ "role": "user",
346
+ "content": [
347
+ {
348
+ "type": "image",
349
+ "image": "https://www.nvidia.com/content/dam/en-zz/Solutions/about-nvidia/logo-and-brand/[email protected]",
350
+ },
351
+ {"type": "text", "text": "Describe this image."},
352
+ ],
353
+ }
354
+ ]
355
+
356
+ text_list = [processor.apply_chat_template(
357
+ messages, tokenize=False, add_generation_prompt=True
358
+ ) for messages in [messages1, messages2]]
359
+ image_inputs, video_inputs = processor.process_vision_info([messages1, messages2])
360
+ inputs = processor(text = text_list, images=image_inputs, videos=video_inputs, return_tensors="pt", padding=True)
361
+ inputs = inputs.to("cuda")
362
+ model = model.to("cuda")
363
+ generated_ids = model.generate(**inputs, max_new_tokens=1024)
364
+ output_text = processor.batch_decode(
365
+ generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
366
+ )
367
+ print(output_text)
368
+ ```
369
+
370
+
371
+ ## TODO
372
+ - [ ] Support vLLM Inference
373
+ - [ ] Provide AWQ Quantization Weights
374
+ - [ ] Provide fine-tuning scripts
375
+
376
+
377
+ ## License/Terms of Use
378
+ - The code is released under the Apache 2.0 license as found in the [LICENSE](https://huggingface.co/NVEagle/Eagle-X5-13B-Chat/blob/main/LICENSE) file.
379
+ - The pretrained model weights are released under the [Creative Commons Attribution: Non-Commercial 4.0 International](https://spdx.org/licenses/CC-BY-NC-4.0) <br>
380
+ - The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
381
+ - Model License of Qwen2.5-0.5B-Instruct: [Apache-2.0](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct/blob/main/LICENSE)
382
+ - Model License of PaliGemma: [Gemma license](https://ai.google.dev/gemma/terms)
383
+
384
+
385
+
386
+ ## Citation
387
+
388
+ ## Ethical Considerations
389
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
390
+
391
+ Please report security vulnerabilities or NVIDIA AI Concerns [here](https://www.nvidia.com/en-us/support/submit-security-vulnerability/).
392