metadata
license: apple-amlr
library_name: ml-fastvlm
tags:
- transformers
FastVLM: Efficient Vision Encoding for Vision Language Models
FastVLM was introduced in FastVLM: Efficient Vision Encoding for Vision Language Models. (CVPR 2025)
Highlights
- We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.
- Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
- Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.
Evaluations
| Benchmark | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B |
|---|---|---|---|
| Ai2D | 68.0 | 77.4 | 83.6 |
| ScienceQA | 85.2 | 94.4 | 96.7 |
| MMMU | 33.9 | 37.8 | 45.4 |
| VQAv2 | 76.3 | 79.1 | 80.8 |
| ChartQA | 76.0 | 80.1 | 85.0 |
| TextVQA | 64.5 | 70.4 | 74.9 |
| InfoVQA | 46.4 | 59.7 | 75.8 |
| DocVQA | 82.5 | 88.3 | 93.2 |
| OCRBench | 63.9 | 70.2 | 73.1 |
| RealWorldQA | 56.1 | 61.2 | 67.2 |
| SeedBench-Img | 71.0 | 74.2 | 75.4 |
Usage Example
To run inference of PyTorch checkpoint, follow the instruction in the official repo:
Download the model
huggingface-cli download apple/FastVLM-0.5B
Run inference using predict.py from the official repo.
python predict.py --model-path /path/to/checkpoint-dir \
--image-file /path/to/image.png \
--prompt "Describe the image."
Run inference with Transformers (Remote Code)
To run inference with transformers we can leverage trust_remote_code along with the following snippet:
from transformers import AutoModelForCausalLM, AutoProcessor
model_id = "apple/FastVLM-0.5B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
trust_remote_code=True,
)
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image_url},
{"type": "text", "text": "Describe this image in detail."},
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
)
out = model.generate(
**inputs,
do_sample=False,
max_new_tokens=150,
)
print(processor.tokenizer.decode(out[0], skip_special_tokens=False))
Citation
If you found this model useful, please cite the following paper:
@InProceedings{fastvlm2025,
author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}