metadata
license: apache-2.0
base_model: Qwen/Qwen2.5-3B-Instruct
library_name: transformers
pipeline_tag: image-to-text
tags:
- multimodal
- video-understanding
- spatial-reasoning
- vision-language
datasets:
- nyu-visionx/VSI-590K
language:
- en
Cambrian-S-3B
Website | Paper | GitHub | Cambrian-S Family
Authors: Shusheng Yang*, Jihan Yang*, Pinzhi Huang†, Ellis Brown†, et al.
Cambrian-S-3B is a spatially-grounded multimodal large language model that excels at spatial reasoning in video understanding. It achieves state-of-the-art performance on visual-spatial benchmarks while maintaining competitive performance on general video understanding tasks.
Model Details
- Architecture: Qwen2.5-3B-Instruct + SigLIP2-SO400M vision encoder + 2-layer MLP adapter
- Parameters: 3B
- Vision Encoder: SigLIP-384 (SiGLIP)
- Training: 4-stage pipeline (image alignment → image IT → video IT → spatial IT)
- Training Data: Trained on VSI-590K (spatial reasoning) + general video instruction data
Usage
from cambrian.model.builder import load_pretrained_model
from cambrian.mm_utils import process_images, tokenizer_image_token
from cambrian.conversation import conv_templates
model_path = "nyu-visionx/Cambrian-S-3B"
tokenizer, model, image_processor, _ = load_pretrained_model(model_path, None, "cambrian-s-3b", device_map="cuda")
# Process image/video
conv = conv_templates["qwen_2"].copy()
conv.append_message(conv.roles[0], "<image>\nWhat objects are in this scene?")
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()
# Generate
output_ids = model.generate(input_ids, images=image_tensor, image_sizes=image_sizes)
Citation
@article{yang2025cambrian,
title={Cambrian-S: Towards Spatial Supersensing in Video},
author={Yang, Shusheng and Yang, Jihan and Huang, Pinzhi and Brown, Ellis and Yang, Zihao and Yu, Yue and Tong, Shengbang and Zheng, Zihan and Xu, Yifan and Wang, Muhan and Lu, Danhao and Fergus, Rob and LeCun, Yann and Fei-Fei, Li and Xie, Saining},
journal={arXiv preprint arXiv:2511.04670},
year={2025}
}