Model Card for CASA-Helium1-VL-2B

CASA (Project Page . arXiv . github) stands for Cross-Attention via Self-Attention. CASA is a vision-language fusion paradigm that aims to improve on cross-attention while preserving its practical benefits.

Specifically, CASA layers inject visual tokens into a text stream by using image-to-text cross-attention while additionally enabling text-to-text self interaction in the same layer, and contained to smaller local attention windows. This simple modification enables natural gating in the cross-attention mechanism, improving its performance and substantially closing the gap to standard token insertion methods. For qualitative samples of CASA used for live video captioning, please check the associated HuggingFace space.

Model Details

Model Description

This model page contains the model weights for CASA trained from a pretrained text-only Helium1-2B backbone and from the image encoder from Qwen2.5-VL-3B. In the collection, we also provides weights for:

CASA-Qwen2_5-VL-3B: A CASA model adapted from the full pretrained Qwen2.5-VL-3B (keeping the backbone LLM weights are kept frozen)
CASA-Qwen2_5-VL-3B-LiveCC: A CASA model adapted from the full pretrained Qwen2.5-VL-3B and futher finetuned for live video captioning.
Helium1-VL-2B: A reference VLM trained from Helium1-2B with standard token insertion mechanism in the same setting as CASA-Helium1-VL-2B.

Model Summary:

Developed by: Kyutai
Model type: Multimodal vision+text model based on Cross-Attention
Language(s) (NLP): English
License: CC-BY-NC-SA-4.0
LLM Backboner from: Helium1 2B
Image Encoder from: Qwen2.5-VL 3B
Terms of use: As the released models include frozen weights of the Qwen2.5VL-3B image encoder, the weights are subject to the Qwen RESEARCH LICENSE AGREEMENT

Model Sources

Project Page kyutai.org/casa
Preprint arXiv
Repository: Github kyutai-labs/casa

Uses

Direct Use

The intended use of the Helium model is research and development of vision-language systems, including but not limited to image or video understanding.

CASA-Helium1-VL-2B, Helium1-VL-2B and CASA-Qwen2_5-VL-2B can be used as vision-language models to analyze or interpret images as input signals.

CASA-Qwen2_5-VL-2B-LiveCC can be used as a vision-language model on streaming videos as inputs at 2fps.

The models can be used primarly with English as a language. For most downstream use cases, the model should be aligned with supervised fine-tuning, RLHF or related methods.

Out-of-Scope Use

The model should not be used in other languages than the ones on which it was trained. The model is not intended to be used to impersonate other people or any malicious use of any kind.

Bias, Risks, and Limitations

Our CASA-Helium1 model was not aligned to human preferences. As such, the model can generate incorrect, biased, harmful or generally unhelpful content. Thus, the model should not be used for downstream applications without further alignment, evaluations and mitigations of risks.

Recommendations

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

How to Get Started with the Model

See our github repository for additional scripts to perform benchmark evaluation and live video captioning.

Below is a short snippet to show you how to load our models, process inputs, and run inference, using a standard HuggingFace transformers pipeline and chat template.

# Minimal requirements:
# /// script
# requires-python = ">=3.10"
# dependencies = [
#     "rich",
#     "einops>=0.8.1",
#     "torch==2.7.0",
#     "transformers==4.51.3",
#     "torchvision==0.22.0",
#     "flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.0.post2/flash_attn-2.8.0.post2+cu12torch2.7cxx11abiTRUE-cp310-cp310-linux_x86_64.whl"
# ]
# ///
import torch
from transformers.models.auto.modeling_auto import AutoModel
from transformers.models.auto.processing_auto import AutoProcessor    

model_id = "kyutai/CASA-Helium1-VL-2B"
model = AutoModel.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
).cuda()
processor = AutoProcessor.from_pretrained(
    model_id,
    trust_remote_code=True,
)

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "assets/casa_model.png",
            },
            {
                "type": "text",
                "text": "Describe this image.",
            },
        ],
    },
]
inputs = processor.tokenize_messages(messages=conversation)
inputs = inputs.to(model.device)
input_len = inputs["input_ids"].shape[1]
output_ids = model.generate_from_image(
  **inputs, 
  max_new_tokens=512,
  pre_image_tokens=processor.pre_image_tokens,
  post_image_tokens=processor.post_image_tokens,
  eos_token_id=model.generation_config.eos_token_id,
)[0, input_len:]
response = processor.tokenizer.decode(output_ids, skip_special_tokens=True)
print(response)

Training Details

Please have a look at our associated research paper for details on the training pipeline.

Training Data

To train our CASA-Helium models we use the FineVision dataset as well as a small, non overlapping, subset of Llava-OneVision-1.5-Instruct

Evaluation

We evaluate our models on a range of benchmarks covering document understanding (DocVQA), chart understanding (ChartQA, InfoVQA), visual text reading (TextVQA, OCRBench), and general QA (RealWorldQA, AI2D, GQA, MME). Results are reported below. Please refer to our project page and arxiv paper for additional evaluation.

Model	Document / Chart			Scene Text		Knowledge / QA
Model	ChartQA	DocVQA	InfoVQA	OCRBench	TextVQA	RealWorldQA	AI2D	GQA	MME
Helium1-VL-2B	81.6	89.1	61.8	728	75.5	59.9	67.7	55.5	1732
CASA-Helium1-VL-2B	73.4	83.7	48.6	723	71.0	58.3	63.3	54.6	1572
mPLUG-Owl3 8B	59.2^†	55.9^†	36.8^†	527^†	69.0	63.9^†	73.4	65.0	1940^†
mPLUG-Owl3 2B	48.5^†	48.2^†	28.1^†	450^†	62.6	56.9^†	62.6	61.0	1551^†

^† Reproduced with the publicly available models on Hugging Face.

Results for CASA-Helium1-VL-2B compared to a recent cross-attention baseline (blue), and our token insertion (Helium1-VL-2B trained in the same conditions. CASA outperforms current SoTA cross-attention-based VLMs, narrowing the gap to insertion-based approaches.

Model	Document / Chart			Scene Text		Knowledge / QA
Model	ChartQA	DocVQA	InfoVQA	OCRBench	TextVQA	RealWorldQA	AI2D	GQA	MME
Qwen2.5-VL-3B	84.0	93.6	77.1	797	79.3	62.2^†	81.6	61.0^†	2249^†
CASA-Qwen2_5-VL-3B	82.4	88.9	59.6	790	77.4	62.5	75.1	59.4	1918

^† Reproduced with the publicly available models on Hugging Face.

Results for CASA-Qwen2_5-VL-3B, adapted from frozen Qwen2.5-VL. CASA reaches performance close to the original insertion-based model while while training only the CASA layers and last blocks of the image encoder.

Technical Specifications

Compute Infrastructure

CASA-Helium1-2B was trained starting from a Helium1-2B LLM and the image encoder from Qwen2.5-VL-3B. We finetune the whole LLM backbone as well as the last four blocks of the image encoder. The currently released model was trained on four DGX nodes with 8 H100 GPUs.

Software

Our training code and inference code was implemented in Pytorch.

Citation

@article{kyutai2025casa,
  author = {Moritz Böhle and Amélie Royer and Juliette Marrie and Edouard Grave and Patrick Pérez},
  year = {2025},
  title = {CASA: Cross-Attention vis Self-Attention},
  journal = {ArXiv},
  url = {https://arxiv.org/abs/2512.19535}
}