Qwen2.5-VL-32B-Instruct (Abliterated)

A 32-billion parameter vision-language model with enhanced mathematical reasoning, multimodal understanding, and removed safety guardrails.

Qwen2.5-VL-32B-Instruct-Abliterated is an uncensored variant of the instruction-tuned Qwen2.5-VL-32B model, developed by Qwen team at Alibaba Cloud with abliteration processing to remove refusal mechanisms. This model maintains the original's advanced vision-language capabilities while providing unrestricted responses without safety filtering or content moderation.

Model Description

Core Capabilities

Vision-Language Understanding:

  • Visual Analysis: Comprehends common objects, text, charts, icons, and complex layouts within images
  • Document Parsing: Extracts structured information from invoices, forms, and tables with high accuracy (DocVQA: 94.8)
  • Long Video Comprehension: Processes videos exceeding 1 hour with event capture and temporal understanding
  • Visual Localization: Generates bounding boxes and precise point coordinates for object detection
  • Agentic Functionality: Performs visual reasoning and provides tool direction for computer and phone interactions

Enhanced Reasoning:

  • Reinforcement learning-trained for superior mathematical problem-solving (MATH: 82.2, MathVista: 74.7)
  • Multi-step complex reasoning across vision and language domains
  • Detailed, well-formatted answers without content restrictions
  • Improved accuracy in visual logic deduction and content recognition

Abliteration Features:

  • Unrestricted Responses: Safety guardrails and refusal mechanisms removed
  • Uncensored Output: No content filtering or moderation applied
  • Direct Answers: Responds to all queries without ethical hedging or refusals
  • Research/Educational Use: Intended for research, development, and responsible applications

Technical Architecture

  • Parameters: 33 billion (BF16/F32 precision)
  • Context Length: 32,768 tokens (expandable with YaRN technique)
  • Vision Tokens: 4-16,384 visual tokens (configurable min/max pixels)
  • ViT Enhancements: Window attention, SwiGLU activation, RMSNorm normalization
  • Dynamic Resolution: Adaptive image resolution and video frame rate sampling

Repository Contents

Model Files:

  • qwen2.5-vl-32b-instruct-abliterated.safetensors - Abliterated model weights (62.3GB, BF16 precision)

Additional Files (from Hugging Face):

  • config.json - Model configuration
  • preprocessor_config.json - Image/video preprocessing settings
  • tokenizer.json / tokenizer_config.json - Tokenizer files
  • generation_config.json - Text generation parameters
  • processor_config.json - Unified processor configuration

Total Repository Size: 62.3GB (single-file safetensors format)

Hardware Requirements

Minimum Requirements

  • GPU VRAM: 70-80GB (A100/H100 recommended for BF16 inference)
  • System RAM: 32GB minimum
  • Disk Space: 70GB (including model files and runtime overhead)
  • CUDA: Version 11.8 or higher

Recommended Configurations

Standard Inference (BF16):

  • 1x NVIDIA A100 80GB or H100 80GB
  • 64GB system RAM
  • NVMe SSD storage for optimal loading times

Quantized Inference (INT8/INT4):

  • 1x NVIDIA RTX 4090 (24GB) with quantization
  • 32GB system RAM
  • Consider using quantized variants for reduced VRAM usage

Production Deployment:

  • Multi-GPU setup for batched inference
  • 128GB+ system RAM for concurrent requests
  • High-bandwidth GPU interconnect (NVLink/NVSwitch)

Performance Optimization

  • Enable Flash Attention 2 for 2-3x speed improvement
  • Use quantized models (INT8/INT4) for memory-constrained environments
  • Batch processing for throughput optimization
  • Consider vLLM or TGI for production inference serving

Usage Examples

Installation

# Install from transformers source (recommended)
pip install git+https://github.com/huggingface/transformers accelerate
pip install qwen-vl-utils[decord]==0.0.8

# Additional dependencies
pip install torch torchvision pillow

Basic Image Understanding

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("E:/huggingface/qwen2.5-vl-32b-instruct")

# Prepare image and text input
image = Image.open("your_image.jpg")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

# Process and generate
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt", padding=True)
inputs = inputs.to("cuda")

# Generate response
output_ids = model.generate(**inputs, max_new_tokens=512)
generated_text = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_text[0])

Multi-Image Analysis

# Multiple images in conversation
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "image1.jpg"},
            {"type": "image", "image": "image2.jpg"},
            {"type": "text", "text": "Compare these two images and identify the differences."}
        ]
    }
]

# Process with multiple images
images = [Image.open("image1.jpg"), Image.open("image2.jpg")]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=images, return_tensors="pt")
inputs = inputs.to("cuda")

output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])

Video Understanding

from qwen_vl_utils import process_vision_info

# Process video input
messages = [
    {
        "role": "user",
        "content": [
            {"type": "video", "video": "your_video.mp4", "fps": 1.0},
            {"type": "text", "text": "Summarize the main events in this video."}
        ]
    }
]

# Process video with configurable FPS
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt"
)
inputs = inputs.to("cuda")

# Generate video analysis
output_ids = model.generate(**inputs, max_new_tokens=1024)
response = processor.batch_decode(output_ids, skip_special_tokens=True)
print(response[0])

Mathematical Problem Solving

# Enhanced mathematical reasoning
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "math_diagram.png"},
            {"type": "text", "text": "Solve this geometry problem step by step."}
        ]
    }
]

# Process with detailed reasoning
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("math_diagram.png")], return_tensors="pt")
inputs = inputs.to("cuda")

# Generate detailed solution
output_ids = model.generate(**inputs, max_new_tokens=2048, do_sample=False)
solution = processor.batch_decode(output_ids, skip_special_tokens=True)
print(solution[0])

Document Parsing (Structured Output)

# Extract structured information from documents
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "invoice.jpg"},
            {"type": "text", "text": "Extract all line items from this invoice in JSON format."}
        ]
    }
]

# Process document
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[Image.open("invoice.jpg")], return_tensors="pt")
inputs = inputs.to("cuda")

# Generate structured output
output_ids = model.generate(**inputs, max_new_tokens=1024, temperature=0.1)
structured_data = processor.batch_decode(output_ids, skip_special_tokens=True)
print(structured_data[0])

Custom Resolution Configuration

# Adjust visual token resolution
processor_config = processor.image_processor
processor_config.min_pixels = 256 * 256  # Minimum resolution
processor_config.max_pixels = 2048 * 2048  # Maximum resolution

# Process with custom resolution
inputs = processor(
    text=[text],
    images=[image],
    return_tensors="pt",
    resized_height=1024,
    resized_width=1024
)

Batch Processing

# Efficient batch inference
batch_messages = [
    [{"role": "user", "content": [{"type": "image", "image": f"img{i}.jpg"},
                                   {"type": "text", "text": "Describe this image."}]}]
    for i in range(4)
]

# Process batch
texts = [processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
         for msg in batch_messages]
images_batch = [Image.open(f"img{i}.jpg") for i in range(4)]
inputs = processor(text=texts, images=images_batch, return_tensors="pt", padding=True)
inputs = inputs.to("cuda")

# Batch generation
output_ids = model.generate(**inputs, max_new_tokens=512)
responses = processor.batch_decode(output_ids, skip_special_tokens=True)
for i, resp in enumerate(responses):
    print(f"Image {i}: {resp}")

Model Specifications

Architecture Details

  • Model Type: Vision-Language Transformer
  • Base Architecture: Qwen2.5 language model + Enhanced ViT vision encoder
  • Vision Encoder: Window attention with SwiGLU and RMSNorm
  • Precision: BF16 (default), FP32 supported
  • Tensor Format: .safetensors (recommended), PyTorch .bin

Training Enhancements

  • Reinforcement Learning: Enhanced mathematical and problem-solving capabilities
  • Dynamic Resolution Training: Adaptive image resolution and video FPS sampling
  • Human Preference Alignment: Improved response formatting and detail
  • Temporal Understanding: Extended dynamic resolution to video temporal dimension

Supported Input Formats

  • Images: JPEG, PNG, WebP, local paths, URLs, base64-encoded
  • Videos: MP4, AVI, configurable FPS (0.5-30 fps typical)
  • Text: UTF-8 encoded, 32K context window
  • Structured Data: Tables, forms, invoices with JSON output capability

Benchmark Performance

Benchmark Score Category
MMMU 70.0 Multimodal Understanding
MMMU-Pro - Advanced Reasoning
MathVista 74.7 Mathematical Reasoning
DocVQA 94.8 Document Understanding
Android Control 69.6/93.3 Agentic Interaction
MMLU 78.4 Language Understanding
MATH 82.2 Mathematical Problem Solving
HumanEval 91.5 Code Generation

Performance Tips and Optimization

Inference Optimization

Flash Attention 2 (Recommended):

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",  # 2-3x faster
    device_map="auto"
)

Quantization for Memory Efficiency:

# INT8 quantization (requires bitsandbytes)
from transformers import BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(
    load_in_8bit=True,
    llm_int8_threshold=6.0
)

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    quantization_config=quantization_config,
    device_map="auto"
)

Extended Context (YaRN):

# For sequences > 32K tokens
model.config.rope_scaling = {
    "type": "yarn",
    "factor": 4.0,  # Extend to 128K tokens
    "original_max_position_embeddings": 32768
}

Best Practices

  1. Image Resolution: Use 512-1024px for standard images, up to 2048px for detailed document parsing
  2. Video Processing: Adjust FPS based on content (1-2 fps for static scenes, 5-10 fps for action)
  3. Batch Size: Start with batch_size=1-2 for 80GB VRAM, scale based on sequence length
  4. Temperature: Use 0.1-0.3 for factual tasks, 0.7-0.9 for creative generation
  5. Max Tokens: Allocate 512 tokens for descriptions, 2048+ for detailed analysis or math

Memory Management

# Clear CUDA cache between runs
import torch
torch.cuda.empty_cache()

# Gradient checkpointing for fine-tuning
model.gradient_checkpointing_enable()

# CPU offloading for large batches
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "E:/huggingface/qwen2.5-vl-32b-instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    offload_folder="offload",
    offload_state_dict=True
)

Production Deployment

Using vLLM (High Throughput):

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model E:/huggingface/qwen2.5-vl-32b-instruct \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.9

Using Text Generation Inference:

docker run --gpus all --shm-size 1g -p 8080:80 \
    -v E:/huggingface/qwen2.5-vl-32b-instruct:/data \
    ghcr.io/huggingface/text-generation-inference:latest \
    --model-id /data --dtype bfloat16

Fine-tuning

The model supports fine-tuning using standard Hugging Face training workflows:

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir="./qwen-vl-finetuned",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-5,
    num_train_epochs=3,
    bf16=True,
    logging_steps=10,
    save_strategy="epoch"
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    data_collator=data_collator
)

trainer.train()

Available Variants

  • Quantized Models: 25 quantized versions (INT8, INT4, GPTQ, AWQ)
  • Fine-tuned Adapters: 5 LoRA/QLoRA adapter models
  • Specialized Fine-tunes: 47 community fine-tuned variants for specific domains

License

This model is released under the Apache License 2.0 (base model license).

Copyright 2025 Alibaba Cloud (base model)

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

Terms of Use

  • ✅ Commercial use allowed
  • ✅ Modification and distribution permitted
  • ✅ Private and public use
  • ⚠️ Must include original license and copyright notice
  • ⚠️ Provided "as-is" without warranty

Important Notice - Abliterated Model

⚠️ This is an uncensored, abliterated variant with removed safety mechanisms:

  • User Responsibility: Users are solely responsible for appropriate use and ethical considerations
  • No Built-in Safety: This model does not include content filtering or safety guardrails
  • Intended Use: Research, development, and responsible applications with proper oversight
  • Not Endorsed: This abliterated variant is not officially endorsed or supported by Alibaba Cloud
  • Legal Compliance: Users must ensure compliance with all applicable laws and regulations

Citation

If you use Qwen2.5-VL-32B-Instruct in your research or applications, please cite:

@article{qwen2.5-vl,
  title={Qwen2.5-VL Technical Report},
  author={Bai, Jinze and others},
  journal={arXiv preprint arXiv:2502.13923},
  year={2025},
  url={https://arxiv.org/abs/2502.13923}
}

Contact and Resources

Official Resources

Community and Support

Related Models

  • Qwen2.5-VL-7B-Instruct: Smaller variant for resource-constrained environments
  • Qwen2.5-VL-72B-Instruct: Larger variant with enhanced capabilities
  • Qwen2-VL-72B-Instruct: Previous generation model
  • Qwen2.5-VL-32B-Instruct: Original censored version with safety guardrails

Base Model: Qwen Team, Alibaba Cloud Abliteration: Community modification (uncensored variant) Base Release Date: March 25, 2025 Model Version: 2.5 (abliterated) README Version: v1.1

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including wangkanai/qwen2.5-vl-32b-instruct