Qwen3-VL-32B-Instruct

Model Description

Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

Key Capabilities

  • Vision-Language Understanding: Advanced multimodal reasoning combining visual and textual information
  • Visual Agent: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks
  • Visual Coding: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos
  • Spatial Perception: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning
  • Video Understanding: Processes and analyzes video content with temporal indexing and dynamics comprehension
  • Long Context: Native 256K context window, expandable to 1 million tokens
  • Multilingual OCR: Optical character recognition across 32 languages
  • STEM Reasoning: Multimodal mathematical and scientific reasoning capabilities

Repository Contents

Note: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository.

Expected Files (when downloaded):

qwen3-vl-32b-instruct/
├── config.json                    # Model configuration
├── generation_config.json         # Generation parameters
├── model-*.safetensors           # Model weight shards (multiple files)
├── model.safetensors.index.json  # Weight shard index
├── preprocessor_config.json      # Preprocessing configuration
├── tokenizer.json                # Tokenizer vocabulary
├── tokenizer_config.json         # Tokenizer configuration
├── merges.txt                    # BPE merges
└── vocab.json                    # Vocabulary file

Estimated Storage Requirements

  • Model Files: ~65-70 GB (BF16 precision)
  • Total Repository: ~70 GB

Hardware Requirements

Minimum Requirements

  • VRAM: 80 GB GPU memory (A100 80GB or equivalent)
  • RAM: 128 GB system memory
  • Disk Space: 100 GB free space (for model files and cache)
  • GPU: NVIDIA GPU with CUDA capability (A100, H100 recommended)

Recommended Setup

  • Multi-GPU: 2x A100 80GB or 4x A100 40GB for optimal performance
  • Flash Attention 2: Strongly recommended for memory efficiency and speed
  • Mixed Precision: BF16 or FP16 for reduced memory footprint

Performance Optimization

  • Enable flash_attention_2 for better acceleration and memory saving
  • Use torch.bfloat16 or automatic dtype selection
  • Consider device mapping for multi-GPU setups
  • Use gradient checkpointing for fine-tuning scenarios

Usage Examples

Installation

pip install transformers accelerate torch pillow
pip install flash-attn --no-build-isolation  # Optional but recommended

Basic Usage with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Model path - update with your local path
model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Recommended
)

processor = AutoProcessor.from_pretrained(model_path)

# Example: Image understanding
image = Image.open("path/to/your/image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
)
inputs = inputs.to(model.device)

# Generate response
output_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False
)

# Decode output
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(generated_text)

Video Understanding Example

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
import cv2
import numpy as np

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load video frames
def load_video_frames(video_path, max_frames=16):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Process video
video_frames = load_video_frames("path/to/video.mp4")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    videos=[video_frames],
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
output_ids = model.generate(
    **inputs,
    max_new_tokens=2048
)

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True
)[0]

print(response)

Multi-Image Reasoning

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load multiple images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these three images and explain the differences."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=images,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(response)

Model Specifications

Architecture Details

  • Model Type: Multimodal Vision-Language Model
  • Parameters: 33 billion
  • Architecture Innovations:
    • Interleaved-MRoPE: Enhanced positional embeddings across temporal and spatial dimensions
    • DeepStack: Multi-level vision transformer feature fusion
    • Text-Timestamp Alignment: Precise video temporal grounding
  • Precision: BF16 (Brain Float 16)
  • Format: Safetensors
  • Context Window: 256K tokens (native), expandable to 1M tokens
  • Max Output Tokens:
    • Vision-language tasks: 16,384 tokens
    • Pure text tasks: 32,768 tokens

Supported Modalities

  • Input: Text, Images (single/multiple), Video frames
  • Output: Text with multimodal understanding and reasoning
  • Image Formats: JPEG, PNG, WebP, and other common formats
  • Video Processing: Frame-based with temporal indexing

Languages Supported

  • Primary: English, Chinese
  • OCR Support: 32 languages including major European, Asian, and Middle Eastern languages

Performance Tips and Optimization

Memory Optimization

  1. Enable Flash Attention 2:

    model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_path,
        attn_implementation="flash_attention_2"
    )
    
  2. Use Mixed Precision:

    model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=torch.bfloat16
    )
    
  3. Device Mapping for Multi-GPU:

    model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_path,
        device_map="auto"  # Automatic distribution across GPUs
    )
    
  4. Gradient Checkpointing (for fine-tuning):

    model.gradient_checkpointing_enable()
    

Inference Speed Optimization

  • Use batch processing for multiple images when possible
  • Preload and cache the model to avoid repeated loading
  • Consider quantization (FP8, INT8) for production deployment
  • Utilize tensor parallelism for very large batch sizes

Quality Optimization

  • For complex reasoning tasks, increase max_new_tokens
  • Use temperature sampling for creative tasks
  • Adjust top_p and top_k for controlled generation
  • Enable do_sample=True for more diverse outputs

License

This model is released under the Apache License 2.0.

You are free to:

  • Use the model commercially
  • Modify and distribute the model
  • Use the model for research purposes

Conditions:

  • Preserve copyright and license notices
  • State significant changes made to the model
  • Include the license text with distributions

See the full license at: https://www.apache.org/licenses/LICENSE-2.0

Citation

If you use Qwen3-VL-32B-Instruct in your research or applications, please cite:

@article{qwen3vl2025,
  title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  institution={Alibaba Cloud}
}

Official Resources

Additional Variants

  • Qwen3-VL-32B-Instruct-FP8: Fine-grained FP8 quantized version for reduced memory usage
  • Qwen3-VL-32B-Instruct-GGUF: GGUF format for llama.cpp compatibility
  • Qwen3-VL-2B-Instruct: Smaller 2B parameter version for edge devices
  • Qwen3-VL-30B-A3B-Instruct: MoE architecture variant

Contact and Support

For questions, issues, or feedback:


Generated with Claude Code - Professional model documentation for local Hugging Face repositories.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including wangkanai/qwen3-vl-32b-instruct