---
license: apache-2.0
tags:
- video LLM
---


# Tarsier Model Card
## Introduction
We propose Tarsier2-7B(-0115) as the latest member of the Tarsier series. Tarsier2-7B sets new state-of-the-art results across 16 public video understanding benchmarks, spanning tasks such as video captioning, video question-answering, video grounding, hallucination test, etc. In terms of the Tarsier series model's main feature - detailed video description, Tarsier2-7B consistently outperformed leading proprietary models, including GPT-4o and Gemini 1.5 Pro, in both automatic metrics and human evaluation.

Compared to [Tarsier-7B](https://huggingface.co/omni-research/Tarsier-7b), Tarsier2-7B is comprehensively upgraded in base model ([Qwen2-VL-7B](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)) and **training data & stage**:
  - Pre-train: We scale up the training data to 40M video-text pairs, featuring in both volume and diversity.
  - SFT: Fine-grained temporal alignment is performed during supervised fine-tuning.
  - DPO: Using model-based sampling to automatically construct preference data and applying DPO training for optimization.

## Model details
- Base Model: [Qwen2-VL-7B-Instruct](https://huggingface.co/Qwen/Qwen2-VL-7B-Instruct)
- Training Data:
    - Pre-train: Over 40M samples of the mixture of video, image and text data, with 20.4M open-source and 19.8M in-house. Detailed as following:
    <div align="center">
    <img src="assets/tarsier2_training_dataset.png" width = "75%">
    </a>
    <br>Figure 1: Summary of datasets used in the pre-training stage of Tarsier2.
    </div>
    - Post-train: 150K human-annotated detailed video descriptions for SFT and 20K automatically sampled and filtered preference pairs for DPO.

**Model date:**
Tarsier2-Recap-7b was trained in December 2024.

**Paper or resources for more information:**
- online demo: https://huggingface.co/spaces/omni-research/Tarsier2-7b
- github repo: https://github.com/bytedance/tarsier/tree/tarsier2
- paper link: https://arxiv.org/abs/2501.07888
- leaderboard: https://tarsier-vlm.github.io/

## Performace
Tarsier2-7B excels in various video understanding tasks, including video captioning, video question-answering, video grounding, hallucination test, etc.
<div align="center">
  <img src="assets/performance_of_tarsier2.png" width = "75%">
  <br>Figure 2: Performance comparison of Tarsier2 with previous SOTA models at 7B-scale and GPT-4o.
</div>

## License
Qwen/Qwen2-VL-7B-Instruct license.

## Intended use
**Primary intended uses:**
The primary use of Tarsier is research on large multimodal models, especially video description.

**Primary intended users:**
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.

## How to Use
see https://github.com/bytedance/tarsier?tab=readme-ov-file#usage.

**Where to send questions or comments about the model:**
https://github.com/bytedance/tarsier/issues

## Citation
If you find our work helpful, feel free to cite us as:
```BibTeX
@misc{yuan2025tarsier2advancinglargevisionlanguage,
      title={Tarsier2: Advancing Large Vision-Language Models from Detailed Video Description to Comprehensive Video Understanding}, 
      author={Liping Yuan and Jiawei Wang and Haomiao Sun and Yuchen Zhang and Yuan Lin},
      year={2025},
      eprint={2501.07888},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2501.07888}, 
}
```