Safetensors
qwen2_5_vl

Model Summary

VR-Thinker is the first Multimodal Reward Model utilizing Thinking-with-Image framework.

For further details, please refer to the following:

Quick Start

We provide a sample test interface here:

import json
import random
import torch
import tqdm
from PIL import Image
import warnings
import os
import requests
import cv2
import numpy as np
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info


warnings.filterwarnings("ignore")




model_path = "qunwang13/vr-thinker"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path, torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained(model_path)


video_urls = [
    "https://cdn.pixabay.com/video/2024/05/20/212623_large.mp4", # sample video 1
    "https://cdn.pixabay.com/video/2024/02/07/199320-912042274_large.mp4"  # sample video 2
]


prompt_for_videos = "A cinematic shot of a waterfall in a lush forest."
dim_name_1, dim_explain_1 = "Temporal Alignment (TA)", "How well the video adheres to the temporal aspects of the prompt."
dim_name_2, dim_explain_2 = "Video Quality (VQ)", "The visual and aesthetic quality of the video."
dim_name_3, dim_explain_3 = "Motion Quality (MQ)", "The smoothness and realism of the motion in the video."
N = 96




prompt_text = \
f"""Task Description: Your task is to compare two videos generated based on the same prompt by analyzing their frames in detail and provide an overall judgment along with a judgment for each dimension.
This involves:
- Iterative reasoning,
- Zooming in on details,
- Dynamically selecting frames for further analysis.

The provided frames are downsampled from these videos:
- Video 1: First four input frames.
- Video 2: Next four input frames.

The prompt is: {prompt_for_videos}

Evaluation Dimensions:
1. **{dim_name_1}**: {dim_explain_1}
2. **{dim_name_2}**: {dim_explain_2}
3. **{dim_name_3}**: {dim_explain_3}

Frames and Analysis Rules:
- 8 sampled frames are provided, evenly downsampled from {N} frames.
- First 4 input frames sampled from {N/2} actual frames of Video 1, next 4 input frames sampled from {N/2} actual frames of Video
- Insufficient frames? Request more using the tool.

Format Requirement:
1. Snapshot:
Every time you receive new visual information, summarize any information that might be useful for your final judgment within <snapshot></snapshot> tags.
2. Think:\nPlace all reasoning content within <think></think> tags.\n\n3. Answer:\nIf the final answer can be determined, output the answer within <final answer></final answer> tags. If the answer is still uncertain, output the recommended answer and confidence level within <recommend answer></recommend answer> tags.
- For TA, MQ, VQ, and OA: 1 represents Video 1 is better, 2 represents Video 2 is better, and 0 represents a Tie.
- For CF (Confidence level): 1 (low), 2 (medium), 3 (high), 4 (very high), 5 (confirmed).

Examples:\n<recommend answer>TA=0, VQ=1, MQ=0, OA=1, CF=2</recommend answer>
<final answer>TA=1, VQ=1, MQ=0, OA=1</final answer>."""


sys_prompt = \
"""You are a helpful assistant.\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:
<tools>\n{\"type\": \"function\", \"function\": {\"name\": \"select_frames\", \"description\": \"Select frames from a video.\", \"parameters\": {\"type\": \"object\", \"properties\":
{\"target_frames\": {\"type\": \"array\", \"description\": \"List of frame indices to select from the video (no more than 8 frames in total).\", \"items\": {\"type\": \"integer\", \"description\": \"Frame index from 1 to N.\"}}},
\"required\": [\"target_frames\"]}}}\n</tools>\n\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\n<tool_call>\n{\"name\": <function-name>, \"arguments\": <args-json-object>}</tool_call>"""


content_list = [{"type": "video", "video": url, "nframes": 4 } for url in video_urls]
content_list.append({"type": "text", "text": prompt_text})

messages = [
    {
        "role": "system",
        "content": sys_prompt,
    },
    {
        "role": "user",
        "content": content_list,
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(
    messages, return_video_kwargs = True
    )

inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")


generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Citation

@misc{wang2025vrthinkerboostingvideoreward,
      title={VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning},
      author={Qunzhong Wang and Jie Liu and Jiajun Liang and Yilei Jiang and Yuanxing Zhang and Jinyuan Chen and Yaozhi Zheng and Xintao Wang and Pengfei Wan and Xiangyu Yue and Jiaheng Liu},
      year={2025},
      eprint={2510.10518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.10518},
}
Downloads last month
121
Safetensors
Model size
8B params
Tensor type
F16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for qunwang13/vr-thinker

Finetuned
(838)
this model

Datasets used to train qunwang13/vr-thinker