Update pipeline tag and fix paper link in model card

This PR improves the model card by:
- Updating the `pipeline_tag` from `zero-shot-image-classification` to `feature-extraction`. This better reflects the model's primary function as a multimodal embedding model and helps users discover it under the correct category on the Hub (https://huggingface.co/models?pipeline_tag=feature-extraction).
- Correcting the placeholder paper link in the content to the official arXiv link: https://arxiv.org/abs/2506.23115.

Files changed (1) hide show

README.md +16 -12

README.md CHANGED Viewed

@@ -1,22 +1,22 @@
 ---
-tags:
-- mmeb
-- transformers
 language:
 - en
 library_name: transformers
 license: mit
-pipeline_tag: zero-shot-image-classification
 ---
 ## MoCa-Qwen25VL-7B
-[MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/xxxx.pdf). Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou, arXiv 2025
 This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
 The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
-[🏠 Homepage](https://haon-chen.github.io/MoCa/) | [💻 Code](https://github.com/haon-chen/MoCa) | [🤖 MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [🤖 MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [📚 Datasets](https://huggingface.co/moca-embed/datasets) | [📄 Paper](https://arxiv.org/abs/2506.23115)
 **Highlights**
 - SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
@@ -81,7 +81,8 @@ model = Qwen2_5ForEmbedding.from_pretrained(
 ).to("cuda")
 model.eval()
 # Image + Text -> Text
-inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image\n', images=[Image.open(
     'figures/example.jpg')], return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
@@ -98,24 +99,27 @@ print(string, '=', compute_similarity(qry_output, tgt_output))
 ## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
 # Text -> Image
-inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog.\n', return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
-string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.\n'
 tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
 tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
 print(string, '=', compute_similarity(qry_output, tgt_output))
 ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
-inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger.\n', return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
-string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.\n'
 tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
 tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
 print(string, '=', compute_similarity(qry_output, tgt_output))
 ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
-```
 ## Citation

 ---
 language:
 - en
 library_name: transformers
 license: mit
+pipeline_tag: feature-extraction
+tags:
+- mmeb
+- transformers
 ---
 ## MoCa-Qwen25VL-7B
+[MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/2506.23115). Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou, arXiv 2025
 This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
 The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
+[🏠 Homepage](https://haon-chen.github.io/MoCa/) | [💻 Code](https://github.com/haon-chen/MoCa) | [🤖 MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [🤖 MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [📚 Datasets](https://huggingface.co/datasets/moca-embed/datasets) | [📄 Paper](https://arxiv.org/abs/2506.23115)
 **Highlights**
 - SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
 ).to("cuda")
 model.eval()
 # Image + Text -> Text
+inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image
+', images=[Image.open(
     'figures/example.jpg')], return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
 ## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
 # Text -> Image
+inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog.
+', return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
+string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
+'
 tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
 tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
 print(string, '=', compute_similarity(qry_output, tgt_output))
 ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
+inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger.
+', return_tensors="pt").to("cuda")
 qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
+string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
+'
 tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
 tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
 print(string, '=', compute_similarity(qry_output, tgt_output))
 ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
 ## Citation