Instructions to use ob11/Qwen-VL-PRM-3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ob11/Qwen-VL-PRM-3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ob11/Qwen-VL-PRM-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ob11/Qwen-VL-PRM-3B")
model = AutoModelForImageTextToText.from_pretrained("ob11/Qwen-VL-PRM-3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ob11/Qwen-VL-PRM-3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ob11/Qwen-VL-PRM-3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ob11/Qwen-VL-PRM-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ob11/Qwen-VL-PRM-3B

SGLang

How to use ob11/Qwen-VL-PRM-3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ob11/Qwen-VL-PRM-3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ob11/Qwen-VL-PRM-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ob11/Qwen-VL-PRM-3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ob11/Qwen-VL-PRM-3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ob11/Qwen-VL-PRM-3B with Docker Model Runner:
```
docker model run hf.co/ob11/Qwen-VL-PRM-3B
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Model Summary

Qwen-VL-PRM-3B is a process reward model finetuned from Qwen2.5-3B-Instruct on approximately 300,000 examples. It demonstrates strong test-time scaling performance improvements on various advanced multimodal reasoning benchmarks when used with Qwen2.5-VL and Gemma-3 models despite being trained mainly on abstract reasoning datasets and elementary reasoning datasets.

Logs: https://wandb.ai/aisg-arf/multimodal-reasoning/runs/pnsncs80
Repository: https://github.com/theogbrand/vlprm
Paper: https://arxiv.org/pdf/2509.23250

Use

The model usage is documented here.

Evaluation

Commercial Models

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
GPT-4o	70.7	60.0	57.8	30.9	31.2	50.1
o1	78.2	78.9	54.4	73.9	60.3	69.1
o3	82.9	84.1	62.3	86.8	--	--

Qwen-2.5-VL Family

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
Qwen-2.5-VL-3B	51.7	34.5	25.7	60.0	21.2	38.6
+ VL-PRM-7B	53.7 (+2.0)	44.9 (+10.5)	28.3 (+2.6)	64.1 (+4.1)	21.8 (+0.6)	42.6 (+4.0)
Qwen-2.5-VL-7B	55.0	48.0	29.1	67.8	24.2	44.8
+ VL-PRM-3B	57.6 (+2.6)	55.5 (+7.5)	33.8 (+4.7)	70.0 (+2.2)	26.1 (+1.9)	48.6 (+3.6)
+ VL-PRM-7B	57.4 (+2.4)	54.8 (+6.8)	35.3 (+6.2)	71.0 (+3.2)	26.2 (+2.0)	48.9 (+4.1)
Qwen-2.5-VL-32B	66.0	46.2	26.9	76.9	36.7	50.5
+ VL-PRM-3B	67.0 (+1.0)	67.1 (+20.8)	41.6 (+14.7)	77.7 (+0.8)	40.5 (+3.8)	58.7 (+8.2)
+ VL-PRM-7B	67.6 (+1.6)	66.8 (+20.6)	44.2 (+17.3)	78.3 (+1.4)	40.1 (+3.2)	59.4 (+8.9)

Gemma-3 Family

Model	MMMU	PuzzleVQA	AlgoPuzzleVQA	MathVista	MathVision	Overall
Gemma-3-12B	57.6	45.0	29.1	58.9	28.1	43.7
+ VL-PRM-3B	60.4 (+2.8)	57.7 (+12.7)	39.7 (+10.6)	60.3 (+1.4)	33.8 (+5.7)	50.4 (+6.7)
+ VL-PRM-7B	60.2 (+2.6)	59.0 (+12.0)	41.1 (+4.5)	63.3 (+4.4)	33.9 (+5.8)	51.5 (+7.8)
Gemma-3-27B	62.9	50.8	29.9	61.6	32.4	47.5
+ VL-PRM-3B	65.5 (+2.6)	67.4 (+16.6)	40.3 (+10.4)	65.4 (+3.8)	39.8 (+7.4)	55.7 (+8.2)
+ VL-PRM-7B	64.5 (+1.6)	67.6 (+16.8)	41.1 (+11.2)	65.2 (+3.6)	40.9 (+8.5)	55.9 (+8.4)

Framework versions

TRL: 0.19.1
Transformers: 4.55.3
Pytorch: 2.7.1
Datasets: 3.0.1
Tokenizers: 0.21.4

Citations

@misc{ong2025vlprms,
      title={Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned}, 
      author={Brandon Ong, Tej Deep Pala, Vernon Toh, William Chandra Tjhi, and Soujanya Poria},
      year={2025},
      eprint={2509.23250},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/pdf/2509.23250}, 
}

Downloads last month: 6

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for ob11/Qwen-VL-PRM-3B

Base model

Qwen/Qwen2.5-VL-3B-Instruct

Finetuned

(751)

this model

Quantizations

2 models

Dataset used to train ob11/Qwen-VL-PRM-3B

Paper for ob11/Qwen-VL-PRM-3B

Training Vision-Language Process Reward Models for Test-Time Scaling in Multimodal Reasoning: Key Insights and Lessons Learned

Paper • 2509.23250 • Published Sep 27, 2025 • 6