XenoZLH
/

Shuffle-R1-Qwen-3B

Safetensors

qwen2_5_vl

Model card Files Files and versions

xet

Community

XenoZLH commited on Aug 28

Commit

1ce00fe

verified ·

1 Parent(s): d13356a

Update README.md

Browse files

Files changed (1) hide show

README.md +101 -3

README.md CHANGED Viewed

@@ -1,3 +1,101 @@
----
-license: apache-2.0
----

+---
+license: apache-2.0
+base_model:
+- Qwen/Qwen2.5-VL-3B-Instruct
+---
+# Shuffle-R1-Qwen-3B
+This is the model checkpoint of Shuffle-R1-Qwen-3B. It is trained based on [**Qwen2.5-VL-3B**](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
+## Model Performance
+| Model | MathVerse | MathVision | MathVista (mini) | WeMath (loose) | HallusionBench | ChartQA | Avg. |
+| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
+| Qwen2.5-VL-3B | 34.8 | 21.9 | 58.4 | 51.7 | 59.8 | 73.1 | 49.9 |
+| Qwen2.5-VL-7B | 42.6 | 25.8 | 67.4 | 63.5 | 65.2 | 79.8 | 57.4 |
+| Shuffle-R1-3B | 44.2 | 26.8 | 70.4 | 66.5 | 69.2 | 79.9 | 59.5 |
+| Shuffle-R1-7B | 53.9 | 30.0 | 77.0 | 72.3 | 71.0 | 84.1 | 64.7 |
+All models are evaluated under CoT prompt.
+## Inference
+### Using *Transformers*
+The process is the same as [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). Note that it is better to add a "Thinking prompt" at the begining of user query.
+```
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
+from qwen_vl_utils import process_vision_info
+model_path = "path/to/your/checkpoint"
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    model_path,
+    torch_dtype=torch.bfloat16,
+    attn_implementation="flash_attention_2",
+    device_map="auto",
+)
+processor = AutoProcessor.from_pretrained(model_path)
+system_prompt = """
+You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}.
+"""
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image", "image": "path/to/your/image"},
+            {"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"},
+        ],
+    }
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to(model.device)
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+### Using *vLLM*
+Our model also supports inference using [**vLLM**](https://github.com/vllm-project/vllm).
+Please refer to our [**Official Repo**](https://github.com/xiaomi-research/shuffle-r1) for detailed instructions.
+## Citation
+If you find our work useful for your research, please consider citing:
+```
+@misc{zhu2025shuffler1,
+      title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle},
+      author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai},
+      year={2025},
+      eprint={2508.05612},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2508.05612},
+}
+```