File size: 3,741 Bytes

---
license: apple-amlr
library_name: ml-fastvlm
tags:
- transformers
---
# FastVLM: Efficient Vision Encoding for Vision Language Models

FastVLM was introduced in
**[FastVLM: Efficient Vision Encoding for Vision Language Models](https://www.arxiv.org/abs/2412.13303). (CVPR 2025)**

[//]: # (![FastViTHD Performance]&#40;acc_vs_latency_qwen-2.png&#41;)
<p align="center">
<img src="acc_vs_latency_qwen-2.png" alt="Accuracy vs latency figure." width="400"/>
</p>

### Highlights
* We introduce FastViTHD, a novel hybrid vision encoder designed to output fewer tokens and significantly reduce encoding time for high-resolution images.  
* Our smallest variant outperforms LLaVA-OneVision-0.5B with 85x faster Time-to-First-Token (TTFT) and 3.4x smaller vision encoder.
* Our larger variants using Qwen2-7B LLM outperform recent works like Cambrian-1-8B while using a single image encoder with a 7.9x faster TTFT.


### Evaluations
| Benchmark     | FastVLM-0.5B | FastVLM-1.5B | FastVLM-7B |
|:--------------|:------------:|:------------:|:----------:|
| Ai2D          |     68.0     |     77.4     |    83.6    |
| ScienceQA     |     85.2     |     94.4     |    96.7    |
| MMMU          |     33.9     |     37.8     |    45.4    |
| VQAv2         |     76.3     |     79.1     |    80.8    |
| ChartQA       |     76.0     |     80.1     |    85.0    |
| TextVQA       |     64.5     |     70.4     |    74.9    |
| InfoVQA       |     46.4     |     59.7     |    75.8    |
| DocVQA        |     82.5     |     88.3     |    93.2    |
| OCRBench      |     63.9     |     70.2     |    73.1    |
| RealWorldQA   |     56.1     |     61.2     |    67.2    |
| SeedBench-Img |     71.0     |     74.2     |    75.4    |


### Usage Example
To run inference of PyTorch checkpoint, follow the instruction in the official repo:

Download the model
```
huggingface-cli download apple/FastVLM-0.5B
``` 

Run inference using `predict.py` from the official repo.
```bash
python predict.py --model-path /path/to/checkpoint-dir \
                  --image-file /path/to/image.png \
                  --prompt "Describe the image."
```

### Run inference with Transformers (Remote Code)
To run inference with transformers we can leverage `trust_remote_code` along with the following snippet:

```python
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "apple/FastVLM-0.5B"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
)

image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image_url},
            {"type": "text", "text": "Describe this image in detail."},
        ]
    }
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
)

out = model.generate(
    **inputs,
    do_sample=False,
    max_new_tokens=150,
)

print(processor.tokenizer.decode(out[0], skip_special_tokens=False))
```

## Citation
If you found this model useful, please cite the following paper:
```
@InProceedings{fastvlm2025,
  author = {Pavan Kumar Anasosalu Vasu, Fartash Faghri, Chun-Liang Li, Cem Koc, Nate True, Albert Antony, Gokul Santhanam, James Gabriel, Peter Grasch, Oncel Tuzel, Hadi Pouransari},
  title = {FastVLM: Efficient Vision Encoding for Vision Language Models},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  month = {June},
  year = {2025},
}
```