File size: 11,953 Bytes
7871b93
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - vision-language
  - qwen3-vl
  - image-to-text
  - video-understanding
---

<!-- README Version: v1.0 -->

# Qwen3-VL-32B-Instruct

## Model Description

Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

### Key Capabilities

- **Vision-Language Understanding**: Advanced multimodal reasoning combining visual and textual information
- **Visual Agent**: Operates PC/mobile GUIsβ€”recognizes elements, understands functions, invokes tools, completes tasks
- **Visual Coding**: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos
- **Spatial Perception**: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning
- **Video Understanding**: Processes and analyzes video content with temporal indexing and dynamics comprehension
- **Long Context**: Native 256K context window, expandable to 1 million tokens
- **Multilingual OCR**: Optical character recognition across 32 languages
- **STEM Reasoning**: Multimodal mathematical and scientific reasoning capabilities

## Repository Contents

**Note**: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository.

### Expected Files (when downloaded):

```
qwen3-vl-32b-instruct/
β”œβ”€β”€ config.json                    # Model configuration
β”œβ”€β”€ generation_config.json         # Generation parameters
β”œβ”€β”€ model-*.safetensors           # Model weight shards (multiple files)
β”œβ”€β”€ model.safetensors.index.json  # Weight shard index
β”œβ”€β”€ preprocessor_config.json      # Preprocessing configuration
β”œβ”€β”€ tokenizer.json                # Tokenizer vocabulary
β”œβ”€β”€ tokenizer_config.json         # Tokenizer configuration
β”œβ”€β”€ merges.txt                    # BPE merges
└── vocab.json                    # Vocabulary file
```

### Estimated Storage Requirements

- **Model Files**: ~65-70 GB (BF16 precision)
- **Total Repository**: ~70 GB

## Hardware Requirements

### Minimum Requirements
- **VRAM**: 80 GB GPU memory (A100 80GB or equivalent)
- **RAM**: 128 GB system memory
- **Disk Space**: 100 GB free space (for model files and cache)
- **GPU**: NVIDIA GPU with CUDA capability (A100, H100 recommended)

### Recommended Setup
- **Multi-GPU**: 2x A100 80GB or 4x A100 40GB for optimal performance
- **Flash Attention 2**: Strongly recommended for memory efficiency and speed
- **Mixed Precision**: BF16 or FP16 for reduced memory footprint

### Performance Optimization
- Enable `flash_attention_2` for better acceleration and memory saving
- Use `torch.bfloat16` or automatic dtype selection
- Consider device mapping for multi-GPU setups
- Use gradient checkpointing for fine-tuning scenarios

## Usage Examples

### Installation

```bash
pip install transformers accelerate torch pillow
pip install flash-attn --no-build-isolation  # Optional but recommended
```

### Basic Usage with Transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Model path - update with your local path
model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Recommended
)

processor = AutoProcessor.from_pretrained(model_path)

# Example: Image understanding
image = Image.open("path/to/your/image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
)
inputs = inputs.to(model.device)

# Generate response
output_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False
)

# Decode output
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(generated_text)
```

### Video Understanding Example

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
import cv2
import numpy as np

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load video frames
def load_video_frames(video_path, max_frames=16):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Process video
video_frames = load_video_frames("path/to/video.mp4")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    videos=[video_frames],
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
output_ids = model.generate(
    **inputs,
    max_new_tokens=2048
)

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True
)[0]

print(response)
```

### Multi-Image Reasoning

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load multiple images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these three images and explain the differences."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=images,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(response)
```

## Model Specifications

### Architecture Details

- **Model Type**: Multimodal Vision-Language Model
- **Parameters**: 33 billion
- **Architecture Innovations**:
  - **Interleaved-MRoPE**: Enhanced positional embeddings across temporal and spatial dimensions
  - **DeepStack**: Multi-level vision transformer feature fusion
  - **Text-Timestamp Alignment**: Precise video temporal grounding
- **Precision**: BF16 (Brain Float 16)
- **Format**: Safetensors
- **Context Window**: 256K tokens (native), expandable to 1M tokens
- **Max Output Tokens**:
  - Vision-language tasks: 16,384 tokens
  - Pure text tasks: 32,768 tokens

### Supported Modalities

- **Input**: Text, Images (single/multiple), Video frames
- **Output**: Text with multimodal understanding and reasoning
- **Image Formats**: JPEG, PNG, WebP, and other common formats
- **Video Processing**: Frame-based with temporal indexing

### Languages Supported

- Primary: English, Chinese
- OCR Support: 32 languages including major European, Asian, and Middle Eastern languages

## Performance Tips and Optimization

### Memory Optimization

1. **Enable Flash Attention 2**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       attn_implementation="flash_attention_2"
   )
   ```

2. **Use Mixed Precision**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       torch_dtype=torch.bfloat16
   )
   ```

3. **Device Mapping for Multi-GPU**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       device_map="auto"  # Automatic distribution across GPUs
   )
   ```

4. **Gradient Checkpointing** (for fine-tuning):
   ```python
   model.gradient_checkpointing_enable()
   ```

### Inference Speed Optimization

- Use batch processing for multiple images when possible
- Preload and cache the model to avoid repeated loading
- Consider quantization (FP8, INT8) for production deployment
- Utilize tensor parallelism for very large batch sizes

### Quality Optimization

- For complex reasoning tasks, increase `max_new_tokens`
- Use temperature sampling for creative tasks
- Adjust `top_p` and `top_k` for controlled generation
- Enable `do_sample=True` for more diverse outputs

## License

This model is released under the **Apache License 2.0**.

You are free to:
- Use the model commercially
- Modify and distribute the model
- Use the model for research purposes

Conditions:
- Preserve copyright and license notices
- State significant changes made to the model
- Include the license text with distributions

See the full license at: https://www.apache.org/licenses/LICENSE-2.0

## Citation

If you use Qwen3-VL-32B-Instruct in your research or applications, please cite:

```bibtex
@article{qwen3vl2025,
  title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  institution={Alibaba Cloud}
}
```

## Official Resources

- **Official Model**: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
- **GitHub Repository**: [https://github.com/QwenLM/Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- **Documentation**: [https://huggingface.co/docs/transformers/model_doc/qwen3_vl](https://huggingface.co/docs/transformers/model_doc/qwen3_vl)
- **Model Collection**: [https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe)
- **Qwen Website**: [https://qwenlm.github.io](https://qwenlm.github.io)

## Additional Variants

- **Qwen3-VL-32B-Instruct-FP8**: Fine-grained FP8 quantized version for reduced memory usage
- **Qwen3-VL-32B-Instruct-GGUF**: GGUF format for llama.cpp compatibility
- **Qwen3-VL-2B-Instruct**: Smaller 2B parameter version for edge devices
- **Qwen3-VL-30B-A3B-Instruct**: MoE architecture variant

## Contact and Support

For questions, issues, or feedback:
- GitHub Issues: [https://github.com/QwenLM/Qwen3-VL/issues](https://github.com/QwenLM/Qwen3-VL/issues)
- Hugging Face Community: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions)

---

**Generated with Claude Code** - Professional model documentation for local Hugging Face repositories.