Model Card for CASA-Helium1-VL-2B
CASA (Project Page . arXiv . github) stands for Cross-Attention via Self-Attention. CASA is a vision-language fusion paradigm that aims to improve on cross-attention while preserving its practical benefits.
Specifically, CASA layers inject visual tokens into a text stream by using image-to-text cross-attention while additionally enabling text-to-text self interaction in the same layer, and contained to smaller local attention windows. This simple modification enables natural gating in the cross-attention mechanism, improving its performance and substantially closing the gap to standard token insertion methods. For qualitative samples of CASA used for live video captioning, please check the associated HuggingFace space.
Model Details
Model Description
This model page contains the model weights for CASA trained from a pretrained text-only Helium1-2B backbone and from the image encoder from Qwen2.5-VL-3B. In the collection, we also provides weights for:
CASA-Qwen2_5-VL-3B: A CASA model adapted from the full pretrainedQwen2.5-VL-3B(keeping the backbone LLM weights are kept frozen)CASA-Qwen2_5-VL-3B-LiveCC: A CASA model adapted from the full pretrainedQwen2.5-VL-3Band futher finetuned for live video captioning.Helium1-VL-2B: A reference VLM trained from Helium1-2B with standard token insertion mechanism in the same setting asCASA-Helium1-VL-2B.
Model Summary:
- Developed by: Kyutai
- Model type: Multimodal vision+text model based on Cross-Attention
- Language(s) (NLP): English
- License: CC-BY-NC-SA-4.0
- LLM Backboner from: Helium1 2B
- Image Encoder from: Qwen2.5-VL 3B
- Terms of use: As the released models include frozen weights of the Qwen2.5VL-3B image encoder, the weights are subject to the Qwen RESEARCH LICENSE AGREEMENT
Model Sources
- Project Page kyutai.org/casa
- Preprint arXiv
- Repository: Github kyutai-labs/casa
Uses
Direct Use
The intended use of the Helium model is research and development of vision-language systems, including but not limited to image or video understanding.
CASA-Helium1-VL-2B, Helium1-VL-2B and CASA-Qwen2_5-VL-2B can be used as vision-language models to analyze or interpret images as input signals.
CASA-Qwen2_5-VL-2B-LiveCC can be used as a vision-language model on streaming videos as inputs at 2fps.
The models can be used primarly with English as a language. For most downstream use cases, the model should be aligned with supervised fine-tuning, RLHF or related methods.
Out-of-Scope Use
The model should not be used in other languages than the ones on which it was trained. The model is not intended to be used to impersonate other people or any malicious use of any kind.
Bias, Risks, and Limitations
Our CASA-Helium1 model was not aligned to human preferences. As such, the model can generate incorrect, biased, harmful or generally unhelpful content. Thus, the model should not be used for downstream applications without further alignment, evaluations and mitigations of risks.
Recommendations
Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
How to Get Started with the Model
See our github repository for additional scripts to perform benchmark evaluation and live video captioning.
Below is a short snippet to show you how to load our models, process inputs, and run inference, using a standard HuggingFace transformers pipeline and chat template.
# Minimal requirements:
# /// script
# requires-python = ">=3.10"
# dependencies = [
# "rich",
# "einops>=0.8.1",
# "torch==2.7.0",
# "transformers==4.51.3",
# "torchvision==0.22.0",
# "flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl"
# ]
# ///
import torch
from transformers.models.auto.modeling_auto import AutoModel
from transformers.models.auto.processing_auto import AutoProcessor
model_id = "kyutai/CASA-Helium1-VL-2B"
model = AutoModel.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
trust_remote_code=True,
).cuda()
processor = AutoProcessor.from_pretrained(
model_id,
trust_remote_code=True,
)
conversation = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "assets/casa_model.png",
},
{
"type": "text",
"text": "Describe this image.",
},
],
},
]
inputs = processor.tokenize_messages(messages=conversation)
inputs = inputs.to(model.device)
input_len = inputs["input_ids"].shape[1]
output_ids = model.generate_from_image(
**inputs,
max_new_tokens=512,
pre_image_tokens=processor.pre_image_tokens,
post_image_tokens=processor.post_image_tokens,
eos_token_id=model.generation_config.eos_token_id,
)[0, input_len:]
response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)
Training Details
Please have a look at our associated research paper for details on the training pipeline.
Training Data
To train our CASA-Helium models we use the FineVision dataset as well as a small, non overlapping, subset of Llava-OneVision-1.5-Instruct
Evaluation
We evaluate our models on a range of benchmarks covering document understanding (DocVQA), chart understanding (ChartQA, InfoVQA),
visual text reading (TextVQA, OCRBench), and general QA (RealWorldQA, AI2D, GQA, MME). Results are reported below. Please refer to our project page and arxiv paper for additional evaluation.
| Model | Document / Chart | Scene Text | Knowledge / QA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ChartQA | DocVQA | InfoVQA | OCRBench | TextVQA | RealWorldQA | AI2D | GQA | MME | |
| Helium1-VL-2B | 81.6 | 89.1 | 61.8 | 728 | 75.5 | 59.9 | 67.7 | 55.5 | 1732 |
| CASA-Helium1-VL-2B | 73.4 | 83.7 | 48.6 | 723 | 71.0 | 58.3 | 63.3 | 54.6 | 1572 |
| mPLUG-Owl3 8B | 59.2† | 55.9† | 36.8† | 527† | 69.0 | 63.9† | 73.4 | 65.0 | 1940† |
| mPLUG-Owl3 2B | 48.5† | 48.2† | 28.1† | 450† | 62.6 | 56.9† | 62.6 | 61.0 | 1551† |
† Reproduced with the publicly available models on Hugging Face.
Results for CASA-Helium1-VL-2B compared to a recent cross-attention baseline (blue), and our token insertion
(Helium1-VL-2B trained in the same conditions. CASA outperforms current SoTA
cross-attention-based VLMs, narrowing the gap to insertion-based approaches.
| Model | Document / Chart | Scene Text | Knowledge / QA | ||||||
|---|---|---|---|---|---|---|---|---|---|
| ChartQA | DocVQA | InfoVQA | OCRBench | TextVQA | RealWorldQA | AI2D | GQA | MME | |
| Qwen2.5-VL-3B | 84.0 | 93.6 | 77.1 | 797 | 79.3 | 62.2† | 81.6 | 61.0† | 2249† |
| CASA-Qwen2_5-VL-3B | 82.4 | 88.9 | 59.6 | 790 | 77.4 | 62.5 | 75.1 | 59.4 | 1918 |
† Reproduced with the publicly available models on Hugging Face.
Results for CASA-Qwen2_5-VL-3B, adapted from frozen Qwen2.5-VL. CASA reaches performance close to the original
insertion-based model while while training only
the CASA layers and last blocks of the image encoder.
Technical Specifications
Compute Infrastructure
CASA-Helium1-2B was trained starting from a Helium1-2B LLM and the image encoder from Qwen2.5-VL-3B.
We finetune the whole LLM backbone as well as the last four blocks of the image encoder.
The currently released model was trained on four DGX nodes with 8 H100 GPUs.
Software
Our training code and inference code was implemented in Pytorch.
Citation
@article{kyutai2025casa,
author = {Moritz Böhle and Amélie Royer and Juliette Marrie and Edouard Grave and Patrick Pérez},
year = {2025},
title = {CASA: Cross-Attention vis Self-Attention},
journal = {ArXiv},
url = {https://arxiv.org/abs/2512.19535}
}
Model Card Authors and Contact
- Amelie Royer
- Moritz Boehle
- Juliette Marrie
- Downloads last month
- 72
Model tree for kyutai/CASA-Helium1-VL-2B
Base model
kyutai/helium-1-2b