Add pipeline tag and fix image path in model card (#1)

- Add pipeline tag and fix image path in model card (1523c27143502a99ce97b59f5b2bb1758f13a7a7)

Co-authored-by: Niels Rogge <[email protected]>

Files changed (1) hide show

README.md CHANGED Viewed

@@ -1,27 +1,27 @@
 ---
-license: other
-license_name: other
-license_link: https://github.com/TencentARC/TokLIP/blob/main/LICENSE
-language:
-- en
 base_model:
 - google/siglip2-so400m-patch16-384
 - google/siglip2-so400m-patch16-256
 tags:
 - Tokenizer
 - CLIP
 - UnifiedMLLM
 ---
 # TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
 <h5 align="center">
 [![arXiv](https://img.shields.io/badge/TokLIP-2505.05422-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05422)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/TencentARC/TokLIP)
-[![HuggingFace](https://img.shields.io/badge/🤗%20Model-Huggingface-yellow)](https://huggingface.co/TencentARC/TokLIP)
-[![License](https://img.shields.io/badge/⚖️%20Code%20License-Other-blue)](https://github.com/TencentARC/TokLIP/blob/main/LICENSE)
  <br>
 </h5>
@@ -41,7 +41,7 @@ Your star means a lot to us in developing this project! ⭐⭐⭐
 ## 👀 Introduction
-<img src="./TokLIP.png" alt="TokLIP" style="zoom:50%;" />
 - We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
@@ -137,4 +137,4 @@ Please cite our work if you use our code or discuss our findings in your own res
   journal={arXiv preprint arXiv:2505.05422},
   year={2025}
 }
-```

 ---
 base_model:
 - google/siglip2-so400m-patch16-384
 - google/siglip2-so400m-patch16-256
+language:
+- en
+license: other
+license_name: other
+license_link: https://github.com/TencentARC/TokLIP/blob/main/LICENSE
+pipeline_tag: image-text-to-text
 tags:
 - Tokenizer
 - CLIP
 - UnifiedMLLM
 ---
 # TokLIP: Marry Visual Tokens to CLIP for Multimodal Comprehension and Generation
 <h5 align="center">
 [![arXiv](https://img.shields.io/badge/TokLIP-2505.05422-b31b1b.svg?logo=arXiv)](https://arxiv.org/abs/2505.05422)
 [![GitHub](https://img.shields.io/badge/GitHub-Code-green?logo=github)](https://github.com/TencentARC/TokLIP)
+[![HuggingFace](https://img.shields.io/badge/%F0%9F%A4%97%20Model-Huggingface-yellow)](https://huggingface.co/TencentARC/TokLIP)
+[![License](https://img.shields.io/badge/%E2%9A%96%EF%B8%8F%20Code%20License-Other-blue)](https://github.com/TencentARC/TokLIP/blob/main/LICENSE)
  <br>
 </h5>
 ## 👀 Introduction
+<img src="https://raw.githubusercontent.com/TencentARC/TokLIP/main/docs/TokLIP.png" alt="TokLIP" style="zoom:50%;" />
 - We introduce TokLIP, a visual tokenizer that enhances comprehension by **semanticizing** vector-quantized (VQ) tokens and **incorporating CLIP-level semantics** while enabling end-to-end multimodal autoregressive training with standard VQ tokens.
   journal={arXiv preprint arXiv:2505.05422},
   year={2025}
 }
+```