Multimodal Speculative Decoding
Collection
10 items • Updated • 5
How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("image-text-to-text", model="Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct") # Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct")
model = AutoModelForImageTextToText.from_pretrained("Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct")How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct
How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with Docker Model Runner:
docker model run hf.co/Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct
This model repo is part of a multimodal speculative decoding benchmark suite.
We maintain a unified benchmark codebase that includes multiple methods (Baseline, EAGLE, EAGLE2, Lookahead, MSD, ViSpec) so users can run training/evaluation more easily under one setup.
If you use this checkpoint and benchmark, please cite ViSpec and the original methods you compare against.
@inproceedings{vispec,
title={ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding},
author={Kang, Jialiang and Shu, Han and Li, Wenshuo and Zhai, Yingjie and Chen, Xinghao},
booktitle={Annual Conference on Neural Information Processing Systems},
year={2025}
}
@inproceedings{li2024eagle,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty},
booktitle = {International Conference on Machine Learning},
year = {2024}
}
@inproceedings{li2024eagle2,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees},
booktitle = {Empirical Methods in Natural Language Processing},
year = {2024}
}
@inproceedings{li2025eagle3,
author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
booktitle = {Annual Conference on Neural Information Processing Systems},
year = {2025}
}
Base model
Qwen/Qwen2.5-VL-7B-Instruct