ob11/VL-PRM300K-train
Viewer • Updated • 289k • 392
How to use ob11/Qwen-VL-PRM-3B with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="ob11/Qwen-VL-PRM-3B")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
pipe(text=messages) # Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("ob11/Qwen-VL-PRM-3B")
model = AutoModelForImageTextToText.from_pretrained("ob11/Qwen-VL-PRM-3B")
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))How to use ob11/Qwen-VL-PRM-3B with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ob11/Qwen-VL-PRM-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "ob11/Qwen-VL-PRM-3B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker model run hf.co/ob11/Qwen-VL-PRM-3B
How to use ob11/Qwen-VL-PRM-3B with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "ob11/Qwen-VL-PRM-3B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "ob11/Qwen-VL-PRM-3B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "ob11/Qwen-VL-PRM-3B" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "ob11/Qwen-VL-PRM-3B",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe this image in one sentence."
},
{
"type": "image_url",
"image_url": {
"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
}
}
]
}
]
}'How to use ob11/Qwen-VL-PRM-3B with Docker Model Runner:
docker model run hf.co/ob11/Qwen-VL-PRM-3B
Qwen-VL-PRM-3B is a process reward model finetuned from Qwen2.5-3B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.
The model usage is documented here.
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|---|---|---|---|---|---|---|
| GPT-4o | 70.7 | 60.0 | 57.8 | 30.9 | 31.2 | 50.1 |
| o1 | 78.2 | 78.9 | 54.4 | 73.9 | 60.3 | 69.1 |
| o3 | 82.9 | 84.1 | 62.3 | 86.8 | -- | -- |
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|---|---|---|---|---|---|---|
| Qwen-2.5-VL-3B | 51.7 | 34.5 | 25.7 | 60.0 | 21.2 | 38.6 |
| + VL-PRM-7B | 53.7 (+2.0) | 44.9 (+10.5) | 28.3 (+2.6) | 64.1 (+4.1) | 21.8 (+0.6) | 42.6 (+4.0) |
| Qwen-2.5-VL-7B | 55.0 | 48.0 | 29.1 | 67.8 | 24.2 | 44.8 |
| + VL-PRM-3B | 57.6 (+2.6) | 55.5 (+7.5) | 33.8 (+4.7) | 70.0 (+2.2) | 26.1 (+1.9) | 48.6 (+3.6) |
| + VL-PRM-7B | 57.4 (+2.4) | 54.8 (+6.8) | 35.3 (+6.2) | 71.0 (+3.2) | 26.2 (+2.0) | 48.9 (+4.1) |
| Qwen-2.5-VL-32B | 66.0 | 46.2 | 26.9 | 76.9 | 36.7 | 50.5 |
| + VL-PRM-3B | 67.0 (+1.0) | 67.1 (+20.8) | 41.6 (+14.7) | 77.7 (+0.8) | 40.5 (+3.8) | 58.7 (+8.2) |
| + VL-PRM-7B | 67.6 (+1.6) | 66.8 (+20.6) | 44.2 (+17.3) | 78.3 (+1.4) | 40.1 (+3.2) | 59.4 (+8.9) |
| Model | MMMU | PuzzleVQA | AlgoPuzzleVQA | MathVista | MathVision | Overall |
|---|---|---|---|---|---|---|
| Gemma-3-12B | 57.6 | 45.0 | 29.1 | 58.9 | 28.1 | 43.7 |
| + VL-PRM-3B | 60.4 (+2.8) | 57.7 (+12.7) | 39.7 (+10.6) | 60.3 (+1.4) | 33.8 (+5.7) | 50.4 (+6.7) |
| + VL-PRM-7B | 60.2 (+2.6) | 59.0 (+12.0) | 41.1 (+4.5) | 63.3 (+4.4) | 33.9 (+5.8) | 51.5 (+7.8) |
| Gemma-3-27B | 62.9 | 50.8 | 29.9 | 61.6 | 32.4 | 47.5 |
| + VL-PRM-3B | 65.5 (+2.6) | 67.4 (+16.6) | 40.3 (+10.4) | 65.4 (+3.8) | 39.8 (+7.4) | 55.7 (+8.2) |
| + VL-PRM-7B | 64.5 (+1.6) | 67.6 (+16.8) | 41.1 (+11.2) | 65.2 (+3.6) | 40.9 (+8.5) | 55.9 (+8.4) |
@misc{ong2025vlprms,
title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned},
author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
year={2025},
eprint={2509.23250},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/pdf/2509.23250},
}