--- license: apache-2.0 library_name: transformers pipeline_tag: image-text-to-text tags: - multimodal - vision-language - qwen3-vl - image-to-text - video-understanding --- # Qwen3-VL-32B-Instruct ## Model Description Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. ### Key Capabilities - **Vision-Language Understanding**: Advanced multimodal reasoning combining visual and textual information - **Visual Agent**: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks - **Visual Coding**: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos - **Spatial Perception**: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning - **Video Understanding**: Processes and analyzes video content with temporal indexing and dynamics comprehension - **Long Context**: Native 256K context window, expandable to 1 million tokens - **Multilingual OCR**: Optical character recognition across 32 languages - **STEM Reasoning**: Multimodal mathematical and scientific reasoning capabilities ## Repository Contents **Note**: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository. ### Expected Files (when downloaded): ``` qwen3-vl-32b-instruct/ ├── config.json # Model configuration ├── generation_config.json # Generation parameters ├── model-*.safetensors # Model weight shards (multiple files) ├── model.safetensors.index.json # Weight shard index ├── preprocessor_config.json # Preprocessing configuration ├── tokenizer.json # Tokenizer vocabulary ├── tokenizer_config.json # Tokenizer configuration ├── merges.txt # BPE merges └── vocab.json # Vocabulary file ``` ### Estimated Storage Requirements - **Model Files**: ~65-70 GB (BF16 precision) - **Total Repository**: ~70 GB ## Hardware Requirements ### Minimum Requirements - **VRAM**: 80 GB GPU memory (A100 80GB or equivalent) - **RAM**: 128 GB system memory - **Disk Space**: 100 GB free space (for model files and cache) - **GPU**: NVIDIA GPU with CUDA capability (A100, H100 recommended) ### Recommended Setup - **Multi-GPU**: 2x A100 80GB or 4x A100 40GB for optimal performance - **Flash Attention 2**: Strongly recommended for memory efficiency and speed - **Mixed Precision**: BF16 or FP16 for reduced memory footprint ### Performance Optimization - Enable `flash_attention_2` for better acceleration and memory saving - Use `torch.bfloat16` or automatic dtype selection - Consider device mapping for multi-GPU setups - Use gradient checkpointing for fine-tuning scenarios ## Usage Examples ### Installation ```bash pip install transformers accelerate torch pillow pip install flash-attn --no-build-isolation # Optional but recommended ``` ### Basic Usage with Transformers ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from PIL import Image import torch # Model path - update with your local path model_path = "E:/huggingface/qwen3-vl-32b-instruct" # Load model and processor model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2" # Recommended ) processor = AutoProcessor.from_pretrained(model_path) # Example: Image understanding image = Image.open("path/to/your/image.jpg") messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Describe this image in detail."} ] } ] # Prepare inputs text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=[text], images=[image], return_tensors="pt", padding=True ) inputs = inputs.to(model.device) # Generate response output_ids = model.generate( **inputs, max_new_tokens=1024, do_sample=False ) # Decode output generated_text = processor.batch_decode( output_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] print(generated_text) ``` ### Video Understanding Example ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor import torch import cv2 import numpy as np model_path = "E:/huggingface/qwen3-vl-32b-instruct" # Load model model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2" ) processor = AutoProcessor.from_pretrained(model_path) # Load video frames def load_video_frames(video_path, max_frames=16): cap = cv2.VideoCapture(video_path) frames = [] total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT)) indices = np.linspace(0, total_frames - 1, max_frames, dtype=int) for idx in indices: cap.set(cv2.CAP_PROP_POS_FRAMES, idx) ret, frame = cap.read() if ret: frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) frames.append(Image.fromarray(frame)) cap.release() return frames # Process video video_frames = load_video_frames("path/to/video.mp4") messages = [ { "role": "user", "content": [ {"type": "video"}, {"type": "text", "text": "Summarize what happens in this video."} ] } ] text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=[text], videos=[video_frames], return_tensors="pt" ) inputs = inputs.to(model.device) # Generate output_ids = model.generate( **inputs, max_new_tokens=2048 ) response = processor.batch_decode( output_ids, skip_special_tokens=True )[0] print(response) ``` ### Multi-Image Reasoning ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from PIL import Image import torch model_path = "E:/huggingface/qwen3-vl-32b-instruct" model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="flash_attention_2" ) processor = AutoProcessor.from_pretrained(model_path) # Load multiple images images = [ Image.open("image1.jpg"), Image.open("image2.jpg"), Image.open("image3.jpg") ] messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "image"}, {"type": "image"}, {"type": "text", "text": "Compare these three images and explain the differences."} ] } ] text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( text=[text], images=images, return_tensors="pt" ) inputs = inputs.to(model.device) output_ids = model.generate(**inputs, max_new_tokens=1024) response = processor.batch_decode(output_ids, skip_special_tokens=True)[0] print(response) ``` ## Model Specifications ### Architecture Details - **Model Type**: Multimodal Vision-Language Model - **Parameters**: 33 billion - **Architecture Innovations**: - **Interleaved-MRoPE**: Enhanced positional embeddings across temporal and spatial dimensions - **DeepStack**: Multi-level vision transformer feature fusion - **Text-Timestamp Alignment**: Precise video temporal grounding - **Precision**: BF16 (Brain Float 16) - **Format**: Safetensors - **Context Window**: 256K tokens (native), expandable to 1M tokens - **Max Output Tokens**: - Vision-language tasks: 16,384 tokens - Pure text tasks: 32,768 tokens ### Supported Modalities - **Input**: Text, Images (single/multiple), Video frames - **Output**: Text with multimodal understanding and reasoning - **Image Formats**: JPEG, PNG, WebP, and other common formats - **Video Processing**: Frame-based with temporal indexing ### Languages Supported - Primary: English, Chinese - OCR Support: 32 languages including major European, Asian, and Middle Eastern languages ## Performance Tips and Optimization ### Memory Optimization 1. **Enable Flash Attention 2**: ```python model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, attn_implementation="flash_attention_2" ) ``` 2. **Use Mixed Precision**: ```python model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, torch_dtype=torch.bfloat16 ) ``` 3. **Device Mapping for Multi-GPU**: ```python model = Qwen3VLForConditionalGeneration.from_pretrained( model_path, device_map="auto" # Automatic distribution across GPUs ) ``` 4. **Gradient Checkpointing** (for fine-tuning): ```python model.gradient_checkpointing_enable() ``` ### Inference Speed Optimization - Use batch processing for multiple images when possible - Preload and cache the model to avoid repeated loading - Consider quantization (FP8, INT8) for production deployment - Utilize tensor parallelism for very large batch sizes ### Quality Optimization - For complex reasoning tasks, increase `max_new_tokens` - Use temperature sampling for creative tasks - Adjust `top_p` and `top_k` for controlled generation - Enable `do_sample=True` for more diverse outputs ## License This model is released under the **Apache License 2.0**. You are free to: - Use the model commercially - Modify and distribute the model - Use the model for research purposes Conditions: - Preserve copyright and license notices - State significant changes made to the model - Include the license text with distributions See the full license at: https://www.apache.org/licenses/LICENSE-2.0 ## Citation If you use Qwen3-VL-32B-Instruct in your research or applications, please cite: ```bibtex @article{qwen3vl2025, title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series}, author={Qwen Team}, journal={arXiv preprint}, year={2025}, institution={Alibaba Cloud} } ``` ## Official Resources - **Official Model**: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct) - **GitHub Repository**: [https://github.com/QwenLM/Qwen3-VL](https://github.com/QwenLM/Qwen3-VL) - **Documentation**: [https://huggingface.co/docs/transformers/model_doc/qwen3_vl](https://huggingface.co/docs/transformers/model_doc/qwen3_vl) - **Model Collection**: [https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe) - **Qwen Website**: [https://qwenlm.github.io](https://qwenlm.github.io) ## Additional Variants - **Qwen3-VL-32B-Instruct-FP8**: Fine-grained FP8 quantized version for reduced memory usage - **Qwen3-VL-32B-Instruct-GGUF**: GGUF format for llama.cpp compatibility - **Qwen3-VL-2B-Instruct**: Smaller 2B parameter version for edge devices - **Qwen3-VL-30B-A3B-Instruct**: MoE architecture variant ## Contact and Support For questions, issues, or feedback: - GitHub Issues: [https://github.com/QwenLM/Qwen3-VL/issues](https://github.com/QwenLM/Qwen3-VL/issues) - Hugging Face Community: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions) --- **Generated with Claude Code** - Professional model documentation for local Hugging Face repositories.