File size: 11,953 Bytes

7871b93

---
license: apache-2.0
library_name: transformers
pipeline_tag: image-text-to-text
tags:
  - multimodal
  - vision-language
  - qwen3-vl
  - image-to-text
  - video-understanding
---

<!-- README Version: v1.0 -->

# Qwen3-VL-32B-Instruct

## Model Description

Qwen3-VL-32B-Instruct is a state-of-the-art multimodal large language model developed by Qwen team at Alibaba Cloud. With 33 billion parameters, this model represents the most powerful vision-language model in the Qwen series, delivering comprehensive upgrades across multiple dimensions including superior text understanding & generation, deeper visual perception & reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities.

### Key Capabilities

- **Vision-Language Understanding**: Advanced multimodal reasoning combining visual and textual information
- **Visual Agent**: Operates PC/mobile GUIs—recognizes elements, understands functions, invokes tools, completes tasks
- **Visual Coding**: Generates Draw.io diagrams, HTML, CSS, and JavaScript from images and videos
- **Spatial Perception**: Judges object positions, viewpoints, and occlusions; provides 2D grounding and 3D grounding for spatial reasoning
- **Video Understanding**: Processes and analyzes video content with temporal indexing and dynamics comprehension
- **Long Context**: Native 256K context window, expandable to 1 million tokens
- **Multilingual OCR**: Optical character recognition across 32 languages
- **STEM Reasoning**: Multimodal mathematical and scientific reasoning capabilities

## Repository Contents

**Note**: This directory is prepared for storing Qwen3-VL-32B-Instruct model files. Model files should be downloaded from the official Hugging Face repository.

### Expected Files (when downloaded):

```
qwen3-vl-32b-instruct/
├── config.json                    # Model configuration
├── generation_config.json         # Generation parameters
├── model-*.safetensors           # Model weight shards (multiple files)
├── model.safetensors.index.json  # Weight shard index
├── preprocessor_config.json      # Preprocessing configuration
├── tokenizer.json                # Tokenizer vocabulary
├── tokenizer_config.json         # Tokenizer configuration
├── merges.txt                    # BPE merges
└── vocab.json                    # Vocabulary file
```

### Estimated Storage Requirements

- **Model Files**: ~65-70 GB (BF16 precision)
- **Total Repository**: ~70 GB

## Hardware Requirements

### Minimum Requirements
- **VRAM**: 80 GB GPU memory (A100 80GB or equivalent)
- **RAM**: 128 GB system memory
- **Disk Space**: 100 GB free space (for model files and cache)
- **GPU**: NVIDIA GPU with CUDA capability (A100, H100 recommended)

### Recommended Setup
- **Multi-GPU**: 2x A100 80GB or 4x A100 40GB for optimal performance
- **Flash Attention 2**: Strongly recommended for memory efficiency and speed
- **Mixed Precision**: BF16 or FP16 for reduced memory footprint

### Performance Optimization
- Enable `flash_attention_2` for better acceleration and memory saving
- Use `torch.bfloat16` or automatic dtype selection
- Consider device mapping for multi-GPU setups
- Use gradient checkpointing for fine-tuning scenarios

## Usage Examples

### Installation

```bash
pip install transformers accelerate torch pillow
pip install flash-attn --no-build-isolation  # Optional but recommended
```

### Basic Usage with Transformers

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Model path - update with your local path
model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model and processor
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"  # Recommended
)

processor = AutoProcessor.from_pretrained(model_path)

# Example: Image understanding
image = Image.open("path/to/your/image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Prepare inputs
text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    padding=True
)
inputs = inputs.to(model.device)

# Generate response
output_ids = model.generate(
    **inputs,
    max_new_tokens=1024,
    do_sample=False
)

# Decode output
generated_text = processor.batch_decode(
    output_ids,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)[0]

print(generated_text)
```

### Video Understanding Example

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
import torch
import cv2
import numpy as np

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

# Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load video frames
def load_video_frames(video_path, max_frames=16):
    cap = cv2.VideoCapture(video_path)
    frames = []
    total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
    indices = np.linspace(0, total_frames - 1, max_frames, dtype=int)

    for idx in indices:
        cap.set(cv2.CAP_PROP_POS_FRAMES, idx)
        ret, frame = cap.read()
        if ret:
            frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
            frames.append(Image.fromarray(frame))

    cap.release()
    return frames

# Process video
video_frames = load_video_frames("path/to/video.mp4")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "video"},
            {"type": "text", "text": "Summarize what happens in this video."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    videos=[video_frames],
    return_tensors="pt"
)
inputs = inputs.to(model.device)

# Generate
output_ids = model.generate(
    **inputs,
    max_new_tokens=2048
)

response = processor.batch_decode(
    output_ids,
    skip_special_tokens=True
)[0]

print(response)
```

### Multi-Image Reasoning

```python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model_path = "E:/huggingface/qwen3-vl-32b-instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    attn_implementation="flash_attention_2"
)

processor = AutoProcessor.from_pretrained(model_path)

# Load multiple images
images = [
    Image.open("image1.jpg"),
    Image.open("image2.jpg"),
    Image.open("image3.jpg")
]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "image"},
            {"type": "image"},
            {"type": "text", "text": "Compare these three images and explain the differences."}
        ]
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(
    text=[text],
    images=images,
    return_tensors="pt"
)
inputs = inputs.to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]

print(response)
```

## Model Specifications

### Architecture Details

- **Model Type**: Multimodal Vision-Language Model
- **Parameters**: 33 billion
- **Architecture Innovations**:
  - **Interleaved-MRoPE**: Enhanced positional embeddings across temporal and spatial dimensions
  - **DeepStack**: Multi-level vision transformer feature fusion
  - **Text-Timestamp Alignment**: Precise video temporal grounding
- **Precision**: BF16 (Brain Float 16)
- **Format**: Safetensors
- **Context Window**: 256K tokens (native), expandable to 1M tokens
- **Max Output Tokens**:
  - Vision-language tasks: 16,384 tokens
  - Pure text tasks: 32,768 tokens

### Supported Modalities

- **Input**: Text, Images (single/multiple), Video frames
- **Output**: Text with multimodal understanding and reasoning
- **Image Formats**: JPEG, PNG, WebP, and other common formats
- **Video Processing**: Frame-based with temporal indexing

### Languages Supported

- Primary: English, Chinese
- OCR Support: 32 languages including major European, Asian, and Middle Eastern languages

## Performance Tips and Optimization

### Memory Optimization

1. **Enable Flash Attention 2**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       attn_implementation="flash_attention_2"
   )
   ```

2. **Use Mixed Precision**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       torch_dtype=torch.bfloat16
   )
   ```

3. **Device Mapping for Multi-GPU**:
   ```python
   model = Qwen3VLForConditionalGeneration.from_pretrained(
       model_path,
       device_map="auto"  # Automatic distribution across GPUs
   )
   ```

4. **Gradient Checkpointing** (for fine-tuning):
   ```python
   model.gradient_checkpointing_enable()
   ```

### Inference Speed Optimization

- Use batch processing for multiple images when possible
- Preload and cache the model to avoid repeated loading
- Consider quantization (FP8, INT8) for production deployment
- Utilize tensor parallelism for very large batch sizes

### Quality Optimization

- For complex reasoning tasks, increase `max_new_tokens`
- Use temperature sampling for creative tasks
- Adjust `top_p` and `top_k` for controlled generation
- Enable `do_sample=True` for more diverse outputs

## License

This model is released under the **Apache License 2.0**.

You are free to:
- Use the model commercially
- Modify and distribute the model
- Use the model for research purposes

Conditions:
- Preserve copyright and license notices
- State significant changes made to the model
- Include the license text with distributions

See the full license at: https://www.apache.org/licenses/LICENSE-2.0

## Citation

If you use Qwen3-VL-32B-Instruct in your research or applications, please cite:

```bibtex
@article{qwen3vl2025,
  title={Qwen3-VL: The Most Powerful Vision-Language Model in the Qwen Series},
  author={Qwen Team},
  journal={arXiv preprint},
  year={2025},
  institution={Alibaba Cloud}
}
```

## Official Resources

- **Official Model**: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct)
- **GitHub Repository**: [https://github.com/QwenLM/Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
- **Documentation**: [https://huggingface.co/docs/transformers/model_doc/qwen3_vl](https://huggingface.co/docs/transformers/model_doc/qwen3_vl)
- **Model Collection**: [https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe](https://huggingface.co/collections/Qwen/qwen3-vl-68d2a7c1b8a8afce4ebd2dbe)
- **Qwen Website**: [https://qwenlm.github.io](https://qwenlm.github.io)

## Additional Variants

- **Qwen3-VL-32B-Instruct-FP8**: Fine-grained FP8 quantized version for reduced memory usage
- **Qwen3-VL-32B-Instruct-GGUF**: GGUF format for llama.cpp compatibility
- **Qwen3-VL-2B-Instruct**: Smaller 2B parameter version for edge devices
- **Qwen3-VL-30B-A3B-Instruct**: MoE architecture variant

## Contact and Support

For questions, issues, or feedback:
- GitHub Issues: [https://github.com/QwenLM/Qwen3-VL/issues](https://github.com/QwenLM/Qwen3-VL/issues)
- Hugging Face Community: [https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions](https://huggingface.co/Qwen/Qwen3-VL-32B-Instruct/discussions)

---

**Generated with Claude Code** - Professional model documentation for local Hugging Face repositories.