Instructions to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct")

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct")
model = AutoModelForImageTextToText.from_pretrained("Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct

SGLang

How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct with Docker Model Runner:
```
docker model run hf.co/Cloudriver/ViSpec-Qwen2.5-VL-7B-Instruct
```

ViSpec-Qwen2.5-VL-7B-Instruct (Benchmark Release)

This model repo is part of a multimodal speculative decoding benchmark suite.

Why this repo exists

We maintain a unified benchmark codebase that includes multiple methods (Baseline, EAGLE, EAGLE2, Lookahead, MSD, ViSpec) so users can run training/evaluation more easily under one setup.

The methods are aggregated here for user convenience (shared dataset format, scripts, and metrics).
The original ideas and implementations belong to their respective authors.
This specific Hugging Face repo hosts the ViSpec-Qwen2.5-VL-7B-Instruct checkpoint used in our benchmark runs.

Citation

If you use this checkpoint and benchmark, please cite ViSpec and the original methods you compare against.

ViSpec

@inproceedings{vispec,
  title={ViSpec: Accelerating Vision-Language Models with Vision-Aware Speculative Decoding},
  author={Kang, Jialiang and Shu, Han and Li, Wenshuo and Zhai, Yingjie and Chen, Xinghao},
  booktitle={Annual Conference on Neural Information Processing Systems},
  year={2025}
}

EAGLE / EAGLE2 / EAGLE3

@inproceedings{li2024eagle,
  author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
  title = {{EAGLE}: Speculative Sampling Requires Rethinking Feature Uncertainty},
  booktitle = {International Conference on Machine Learning},
  year = {2024}
}

@inproceedings{li2024eagle2,
  author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
  title = {{EAGLE-2}: Faster Inference of Language Models with Dynamic Draft Trees},
  booktitle = {Empirical Methods in Natural Language Processing},
  year = {2024}
}

@inproceedings{li2025eagle3,
  author = {Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang},
  title = {{EAGLE-3}: Scaling up Inference Acceleration of Large Language Models via Training-Time Test},
  booktitle = {Annual Conference on Neural Information Processing Systems},
  year = {2025}
}

Other integrated baselines (links)

Lookahead Decoding: https://lmsys.org/blog/2023-11-21-lookahead-decoding/
MSD-LLaVA1.5-7B: https://huggingface.co/lucylyn/MSD-LLaVA1.5-7B
Medusa: https://github.com/FasterDecoding/Medusa

Notes

This model card focuses on benchmark usage and attribution.
For full benchmark code and scripts, please refer to the benchmark repository used in your experiment setup.