CircleRadon commited on
Commit
5974a35
·
verified ·
1 Parent(s): f0de86e

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +79 -3
README.md CHANGED
@@ -1,3 +1,79 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ language:
4
+ - en
5
+ metrics:
6
+ - accuracy
7
+ library_name: transformers
8
+ pipeline_tag: video-text-to-text
9
+ tags:
10
+ - multimodal large language model
11
+ - large video-language model
12
+ base_model:
13
+ - DAMO-NLP-SG/VideoLLaMA3-2B-Image
14
+ ---
15
+
16
+
17
+
18
+ <p align="center">
19
+ <img src="https://cdn-uploads.huggingface.co/production/uploads/64a3fe3dde901eb01df12398/CiwESNwyy7VTooOifRgiQ.png" width="70%" style="margin-bottom: 0.2;"/>
20
+ <p>
21
+
22
+
23
+ <h3 align="center"><a href="http://arxiv.org/abs/2510.23603" style="color:#4D2B24">
24
+ PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity</a></h3>
25
+
26
+ <div align=center>
27
+
28
+ ![Static Badge](https://img.shields.io/badge/PixelRefer-v1-F7C97E)
29
+ [![arXiv preprint](https://img.shields.io/badge/arxiv-2510.23603-ECA8A7?logo=arxiv)](https://arxiv.org/abs/2510.23603)
30
+ [![Dataset](https://img.shields.io/badge/Dataset-Hugging_Face-E59FB6)](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K)
31
+ [![Model](https://img.shields.io/badge/Model-Hugging_Face-CFAFD4)](https://huggingface.co/collections/Alibaba-DAMO-Academy/pixelrefer)
32
+ [![Benchmark](https://img.shields.io/badge/Benchmark-Hugging_Face-96D03A)](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-Bench)
33
+
34
+
35
+ [![Homepage](https://img.shields.io/badge/Homepage-visit-9DC3E6)](https://circleradon.github.io/PixelRefer/)
36
+ [![Huggingface](https://img.shields.io/badge/Demo-HuggingFace-E6A151)](https://huggingface.co/spaces/lixin4ever/PixelRefer)
37
+ </div>
38
+
39
+
40
+ ## 📰 News
41
+ * **[2025.10.28]** 🔥We release PixelRefer.
42
+ * **[2025.6.19]** 🔥We release the [demo](https://huggingface.co/spaces/lixin4ever/VideoRefer-VideoLLaMA3) of VideoRefer-VideoLLaMA3, hosted on HuggingFace. Feel free to try it!
43
+ * **[2025.6.18]** 🔥We release a new version of VideoRefer([VideoRefer-VideoLLaMA3-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-7B) and [VideoRefer-VideoLLaMA3-2B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-VideoLLaMA3-2B)), which are trained based on [VideoLLaMA3](https://github.com/DAMO-NLP-SG/VideoLLaMA3).
44
+ * **[2025.4.22]** 🔥Our VideoRefer-Bench has been adopted in [Describe Anything Model](https://arxiv.org/pdf/2504.16072) (NVIDIA & UC Berkeley).
45
+ * **[2025.2.27]** 🔥VideoRefer Suite has been accepted to CVPR2025!
46
+ * **[2025.2.18]** 🔥We release the [VideoRefer-700K dataset](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-700K) on HuggingFace.
47
+ * **[2025.1.1]** 🔥We release [VideoRefer-7B](https://huggingface.co/DAMO-NLP-SG/VideoRefer-7B), the code of VideoRefer and the [VideoRefer-Bench](https://huggingface.co/datasets/DAMO-NLP-SG/VideoRefer-Bench).
48
+
49
+
50
+
51
+
52
+ ## 🌏 Model Zoo
53
+ | Model Name | Visual Encoder | Language Decoder |
54
+ |:----------------|:----------------|:------------------|
55
+ | [PixelRefer-7B](https://huggingface.co/Alibaba-DAMO-Academy/PixelRefer-7B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
56
+ | [PixelRefer-2B](https://huggingface.co/Alibaba-DAMO-Academy/PixelRefer-2B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
57
+ | [PixelRefer-Lite-7B](https://huggingface.co/Alibaba-DAMO-Academy/PixelRefer-Lite-7B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct) |
58
+ | [PixelRefer-Lite-2B](https://huggingface.co/Alibaba-DAMO-Academy/PixelRefer-Lite-2B) | [VL3-SigLIP-NaViT](https://huggingface.co/DAMO-NLP-SG/VL3-SigLIP-NaViT) | [Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) |
59
+
60
+ ## 📑 Citation
61
+
62
+ If you find VideoRefer Suite useful for your research and applications, please cite using this BibTeX:
63
+ ```bibtex
64
+ @article{yuan2025pixelrefer,
65
+ title = {PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity},
66
+ author = {Yuqian Yuan and Wenqiao Zhang and Xin Li and Shihao Wang and Kehan Li and Wentong Li and Jun Xiao and Lei Zhang and Beng Chin Ooi},
67
+ year = {2025},
68
+ journal = {arXiv},
69
+ }
70
+
71
+ @inproceedings{yuan2025videorefer,
72
+ title = {Videorefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM},
73
+ author = {Yuqian Yuan and Hang Zhang and Wentong Li and Zesen Cheng and Boqiang Zhang and Long Li and Xin Li and Deli Zhao and Wenqiao Zhang and Yueting Zhuang and others},
74
+ booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference},
75
+ pages = {18970--18980},
76
+ year = {2025},
77
+ }
78
+
79
+ ```