Qwen3-VL-32B-Instruct
Model Description
Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.
Key Capabilities
- Vision-Language Understanding: Advanced multimodal reasoning combining visual and textual information
- Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks
- Visual Coding: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos
- Spatial Perception: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning
- Video Understanding: Processes and analyzes video content with temporal indexing and dynamics comprehension
- Long Context: Native 256K context window, expandable to 1 million tokens
- Multilingual OCR: Optical character recognition across 32 languages
- STEM Reasoning: Multimodal mathematical and scientific reasoning capabilities
Repository Contents
Note: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository.
Expected Files (when downloaded):
qwen3-vl-32b-instruct/
├── config.json # Model configuration
├── generation_config.json # Generation parameters
├── model-*.safetensors # Model weight shards (multiple files)
├── model.safetensors.index.json # Weight shard index
├── preprocessor_config.json # Preprocessing configuration
├── tokenizer.json # Tokenizer vocabulary
├── tokenizer_config.json # Tokenizer configuration
├── merges.txt # BPE merges
└── vocab.json # Vocabulary file
Estimated Storage Requirements
- Model Files: ~65-70 GB (BF16 precision)
- Total Repository: ~70 GB
Hardware Requirements
Minimum Requirements
- VRAM: 80 GB GPU memory (A100 80GB or equivalent)
- RAM: 128 GB system memory
- Disk Space: 100 GB free space (for model files and cache)
- GPU: NVIDIA GPU with CUDA capability (A100, H100 recommended)
Recommended Setup
- Multi-GPU: 2x A100 80GB or 4x A100 40GB for optimal performance
- Flash Attention 2: Strongly recommended for memory efficiency and speed
- Mixed Precision: BF16 or FP16 for reduced memory footprint
Performance Optimization
- Enable
flash_attention_2for better acceleration and memory saving - Use
torch.bfloat16or automatic dtype selection - Consider device mapping for multi-GPU setups
- Use gradient checkpointing for fine-tuning scenarios
Usage Examples
Installation
pip install transformers accelerate torch pillow
pip install flash-attn --no-build-isolation # Optional but recommended
Basic Usage with Transformers
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
# Model path - update with your local path
model_path = "E:/huggingface/qwen3-vl-32b-instruct"
# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2" # Recommended
)
processor = AutoProcessor.from_pretrained(model_path)
# Example: Image understanding
image = Image.open("path/to/your/image.jpg")
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
# Prepare inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=[text],
images=[image],
return_tensors="pt",
padding=True
)
inputs = inputs.to(model.device)
# Generate response
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
do_sample=False
)
# Decode output
generated_text = processor.batch_decode(
output_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)[0]
print(generated_text)
Video Understanding Example
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
import cv2
import numpy as np
model_path = "E:/huggingface/qwen3-vl-32b-instruct"
# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained(model_path)
# Load video frames
def load_video_frames(video_path, max_frames=16):
cap = cv2.VideoCapture(video_path)
frames = []
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)
for idx in indices:
cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
ret, frame = cap.read()
if ret:
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
frames.append(Image.fromarray(frame))
cap.release()
return frames
# Process video
video_frames = load_video_frames("path/to/video.mp4")
messages = [
{
"role": "user",
"content": [
{"type": "video"},
{"type": "text", "text": "Summarize what happens in this video."}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=[text],
videos=[video_frames],
return_tensors="pt"
)
inputs = inputs.to(model.device)
# Generate
output_ids = model.generate(
**inputs,
max_new_tokens=2048
)
response = processor.batch_decode(
output_ids,
skip_special_tokens=True
)[0]
print(response)
Multi-Image Reasoning
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model_path = "E:/huggingface/qwen3-vl-32b-instruct"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype=torch.bfloat16,
device_map="auto",
attn_implementation="flash_attention_2"
)
processor = AutoProcessor.from_pretrained(model_path)
# Load multiple images
images = [
Image.open("image1.jpg"),
Image.open("image2.jpg"),
Image.open("image3.jpg")
]
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "image"},
{"type": "image"},
{"type": "text", "text": "Compare these three images and explain the differences."}
]
}
]
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
text=[text],
images=images,
return_tensors="pt"
)
inputs = inputs.to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(response)
Model Specifications
Architecture Details
- Model Type: Multimodal Vision-Language Model
- Parameters: 33 billion
- Architecture Innovations:
- Interleaved-MRoPE: Enhanced positional embeddings across temporal and spatial dimensions
- DeepStack: Multi-level vision transformer feature fusion
- Text-Timestamp Alignment: Precise video temporal grounding
- Precision: BF16 (Brain Float 16)
- Format: Safetensors
- Context Window: 256K tokens (native), expandable to 1M tokens
- Max Output Tokens:
- Vision-language tasks: 16,384 tokens
- Pure text tasks: 32,768 tokens
Supported Modalities
- Input: Text, Images (single/multiple), Video frames
- Output: Text with multimodal understanding and reasoning
- Image Formats: JPEG, PNG, WebP, and other common formats
- Video Processing: Frame-based with temporal indexing
Languages Supported
- Primary: English, Chinese
- OCR Support: 32 languages including major European, Asian, and Middle Eastern languages
Performance Tips and Optimization
Memory Optimization
Enable Flash Attention 2:
model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, attn_implementation="flash_attention_2" )Use Mixed Precision:
model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16 )Device Mapping for Multi-GPU:
model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, device_map="auto" # Automatic distribution across GPUs )Gradient Checkpointing (for fine-tuning):
model.gradient_checkpointing_enable()
Inference Speed Optimization
- Use batch processing for multiple images when possible
- Preload and cache the model to avoid repeated loading
- Consider quantization (FP8, INT8) for production deployment
- Utilize tensor parallelism for very large batch sizes
Quality Optimization
- For complex reasoning tasks, increase
max_new_tokens - Use temperature sampling for creative tasks
- Adjust
top_pandtop_kfor controlled generation - Enable
do_sample=Truefor more diverse outputs
License
This model is released under the Apache License 2.0.
You are free to:
- Use the model commercially
- Modify and distribute the model
- Use the model for research purposes
Conditions:
- Preserve copyright and license notices
- State significant changes made to the model
- Include the license text with distributions
See the full license at: https://www.apache.org/licenses/LICENSE-2.0
Citation
If you use Qwen3-VL-32B-Instruct in your research or applications, please cite:
@article{qwen3vl2025,
title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series},
author={Qwen Team},
journal={arXiv preprint},
year={2025},
institution={Alibaba Cloud}
}
Official Resources
- Official Model: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct
- GitHub Repository: https://github.com/QwenLM/Qwen3-VL
- Documentation: https://huggingface.co/docs/transformers/model_doc/qwen3_vl
- Model Collection: https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
- Qwen Website: https://qwenlm.github.io
Additional Variants
- Qwen3-VL-32B-Instruct-FP8: Fine-grained FP8 quantized version for reduced memory usage
- Qwen3-VL-32B-Instruct-GGUF: GGUF format for llama.cpp compatibility
- Qwen3-VL-2B-Instruct: Smaller 2B parameter version for edge devices
- Qwen3-VL-30B-A3B-Instruct: MoE architecture variant
Contact and Support
For questions, issues, or feedback:
- GitHub Issues: https://github.com/QwenLM/Qwen3-VL/issues
- Hugging Face Community: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions
Generated with Claude Code - Professional model documentation for local Hugging Face repositories.