XenoZLH commited on
Commit
1ce00fe
·
verified ·
1 Parent(s): d13356a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +101 -3
README.md CHANGED
@@ -1,3 +1,101 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ base_model:
4
+ - Qwen/Qwen2.5-VL-3B-Instruct
5
+ ---
6
+
7
+
8
+
9
+ # Shuffle-R1-Qwen-3B
10
+
11
+ This is the model checkpoint of Shuffle-R1-Qwen-3B. It is trained based on [**Qwen2.5-VL-3B**](https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct)
12
+
13
+ ## Model Performance
14
+
15
+ | Model | MathVerse | MathVision | MathVista (mini) | WeMath (loose) | HallusionBench | ChartQA | Avg. |
16
+ | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
17
+ | Qwen2.5-VL-3B | 34.8 | 21.9 | 58.4 | 51.7 | 59.8 | 73.1 | 49.9 |
18
+ | Qwen2.5-VL-7B | 42.6 | 25.8 | 67.4 | 63.5 | 65.2 | 79.8 | 57.4 |
19
+ | Shuffle-R1-3B | 44.2 | 26.8 | 70.4 | 66.5 | 69.2 | 79.9 | 59.5 |
20
+ | Shuffle-R1-7B | 53.9 | 30.0 | 77.0 | 72.3 | 71.0 | 84.1 | 64.7 |
21
+
22
+ All models are evaluated under CoT prompt.
23
+
24
+
25
+ ## Inference
26
+
27
+ ### Using *Transformers*
28
+
29
+ The process is the same as [Qwen2.5-VL](https://github.com/QwenLM/Qwen2.5-VL). Note that it is better to add a "Thinking prompt" at the begining of user query.
30
+
31
+ ```
32
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
33
+ from qwen_vl_utils import process_vision_info
34
+
35
+ model_path = "path/to/your/checkpoint"
36
+
37
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
38
+ model_path,
39
+ torch_dtype=torch.bfloat16,
40
+ attn_implementation="flash_attention_2",
41
+ device_map="auto",
42
+ )
43
+
44
+ processor = AutoProcessor.from_pretrained(model_path)
45
+
46
+ system_prompt = """
47
+ You FIRST think about the reasoning process as an internal monologue and then provide the final answer. The reasoning process MUST BE enclosed within <think> </think> tags. The final answer MUST BE put in \\boxed{}.
48
+ """
49
+
50
+ messages = [
51
+ {
52
+ "role": "user",
53
+ "content": [
54
+ {"type": "image", "image": "path/to/your/image"},
55
+ {"type": "text", "text": system_prompt + "YOUR TEXT QUERY HERE"},
56
+ ],
57
+ }
58
+ ]
59
+
60
+ text = processor.apply_chat_template(
61
+ messages, tokenize=False, add_generation_prompt=True
62
+ )
63
+ image_inputs, video_inputs = process_vision_info(messages)
64
+ inputs = processor(
65
+ text=[text],
66
+ images=image_inputs,
67
+ videos=video_inputs,
68
+ padding=True,
69
+ return_tensors="pt",
70
+ )
71
+ inputs = inputs.to(model.device)
72
+
73
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
74
+ generated_ids_trimmed = [
75
+ out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
76
+ ]
77
+ output_text = processor.batch_decode(
78
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
79
+ )
80
+ print(output_text)
81
+ ```
82
+
83
+ ### Using *vLLM*
84
+
85
+ Our model also supports inference using [**vLLM**](https://github.com/vllm-project/vllm).
86
+
87
+ Please refer to our [**Official Repo**](https://github.com/xiaomi-research/shuffle-r1) for detailed instructions.
88
+
89
+ ## Citation
90
+ If you find our work useful for your research, please consider citing:
91
+ ```
92
+ @misc{zhu2025shuffler1,
93
+ title={Shuffle-R1: Efficient RL framework for Multimodal Large Language Models via Data-centric Dynamic Shuffle},
94
+ author={Linghao Zhu, Yiran Guan, Dingkang Liang, Jianzhong Ju, Zhenbo Luo, Bin Qin, Jian Luan, Yuliang Liu, Xiang Bai},
95
+ year={2025},
96
+ eprint={2508.05612},
97
+ archivePrefix={arXiv},
98
+ primaryClass={cs.LG},
99
+ url={https://arxiv.org/abs/2508.05612},
100
+ }
101
+ ```