Qwen3-VL-32B-Instruct

Model Description

Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Key Capabilities

Vision-Language Understanding: Advanced multimodal reasoning combining visual and textual information
Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks
Visual Coding: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos
Spatial Perception: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning
Video Understanding: Processes and analyzes video content with temporal indexing and dynamics comprehension
Long Context: Native 256K context window, expandable to 1 million tokens
Multilingual OCR: Optical character recognition across 32 languages
STEM Reasoning: Multimodal mathematical and scientific reasoning capabilities

Repository Contents

Note: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository.

Expected Files (when downloaded):

qwen3-vl-32b-instruct/
├── config.json                    # Model configuration
├── generation_config.json         # Generation parameters
├── model-*.safetensors           # Model weight shards (multiple files)
├── model.safetensors.index.json  # Weight shard index
├── preprocessor_config.json      # Preprocessing configuration
├── tokenizer.json                # Tokenizer vocabulary
├── tokenizer_config.json         # Tokenizer configuration
├── merges.txt                    # BPE merges
└── vocab.json                    # Vocabulary file

Estimated Storage Requirements

Model Files: ~65-70 GB (BF16 precision)
Total Repository: ~70 GB

Hardware Requirements

Minimum Requirements

VRAM: 80 GB GPU memory (A100 80GB or equivalent)
RAM: 128 GB system memory
Disk Space: 100 GB free space (for model files and cache)
GPU: NVIDIA GPU with CUDA capability (A100, H100 recommended)

Recommended Setup

Multi-GPU: 2x A100 80GB or 4x A100 40GB for optimal performance
Flash Attention 2: Strongly recommended for memory efficiency and speed
Mixed Precision: BF16 or FP16 for reduced memory footprint

Performance Optimization

Enable flash_attention_2 for better acceleration and memory saving
Use torch.bfloat16 or automatic dtype selection
Consider device mapping for multi-GPU setups
Use gradient checkpointing for fine-tuning scenarios

Usage Examples

Installation

pip install transformers accelerate torch pillow
pip install flash-attn --no-build-isolation  # Optional but recommended

Basic Usage with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Model path - update with your local path
model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Recommended
)

processor = AutoProcessor.from_pretrained(model_path)

# Example: Image understanding
image = Image.open("path/to/your/image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
)
inputs = inputs.to(model.device)

# Generate response
output_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False
)

# Decode output
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(generated_text)

Video Understanding Example

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
import cv2
import numpy as np

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load video frames
def load_video_frames(video_path, max_frames=16):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Process video
video_frames = load_video_frames("path/to/video.mp4")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    videos=[video_frames],
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
output_ids = model.generate(
    **inputs,
    max_new_tokens=2048
)

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True
)[0]

print(response)

Multi-Image Reasoning

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load multiple images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these three images and explain the differences."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=images,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(response)

Model Specifications

Architecture Details

Model Type: Multimodal Vision-Language Model
Parameters: 33 billion
Architecture Innovations:
- Interleaved-MRoPE: Enhanced positional embeddings across temporal and spatial dimensions
- DeepStack: Multi-level vision transformer feature fusion
- Text-Timestamp Alignment: Precise video temporal grounding
Precision: BF16 (Brain Float 16)
Format: Safetensors
Context Window: 256K tokens (native), expandable to 1M tokens
Max Output Tokens:
- Vision-language tasks: 16,384 tokens
- Pure text tasks: 32,768 tokens

Supported Modalities

Input: Text, Images (single/multiple), Video frames
Output: Text with multimodal understanding and reasoning
Image Formats: JPEG, PNG, WebP, and other common formats
Video Processing: Frame-based with temporal indexing

Languages Supported

Primary: English, Chinese
OCR Support: 32 languages including major European, Asian, and Middle Eastern languages

Performance Tips and Optimization

Memory Optimization

Enable Flash Attention 2:

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    attn_implementation="flash_attention_2"
)

Use Mixed Precision:

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16
)

Device Mapping for Multi-GPU:

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    device_map="auto"  # Automatic distribution across GPUs
)

Gradient Checkpointing (for fine-tuning):
```
model.gradient_checkpointing_enable()
```

Inference Speed Optimization

Use batch processing for multiple images when possible
Preload and cache the model to avoid repeated loading
Consider quantization (FP8, INT8) for production deployment
Utilize tensor parallelism for very large batch sizes

Quality Optimization

For complex reasoning tasks, increase max_new_tokens
Use temperature sampling for creative tasks
Adjust top_p and top_k for controlled generation
Enable do_sample=True for more diverse outputs

License

This model is released under the Apache License 2.0.

You are free to:

Use the model commercially
Modify and distribute the model
Use the model for research purposes

Conditions:

Preserve copyright and license notices
State significant changes made to the model
Include the license text with distributions

See the full license at: https://www.apache.org/licenses/LICENSE-2.0

Citation

If you use Qwen3-VL-32B-Instruct in your research or applications, please cite:

@article{qwen3vl2025,
  title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  institution={Alibaba Cloud}
}

Official Resources

Official Model: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct
GitHub Repository: https://github.com/QwenLM/Qwen3-VL
Documentation: https://huggingface.co/docs/transformers/model_doc/qwen3_vl
Model Collection: https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe
Qwen Website: https://qwenlm.github.io

Additional Variants

Qwen3-VL-32B-Instruct-FP8: Fine-grained FP8 quantized version for reduced memory usage
Qwen3-VL-32B-Instruct-GGUF: GGUF format for llama.cpp compatibility
Qwen3-VL-2B-Instruct: Smaller 2B parameter version for edge devices
Qwen3-VL-30B-A3B-Instruct: MoE architecture variant

Contact and Support

For questions, issues, or feedback:

GitHub Issues: https://github.com/QwenLM/Qwen3-VL/issues
Hugging Face Community: https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions

Generated with Claude Code - Professional model documentation for local Hugging Face repositories.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including wangkanai/qwen3-vl-32b-instruct

qwen3-vl

Collection

Qwen3 vision language • 9 items • Updated 17 days ago • 1