Update pipeline tag and fix paper link in model card
Browse filesThis PR improves the model card by:
- Updating the `pipeline_tag` from `zero-shot-image-classification` to `feature-extraction`. This better reflects the model's primary function as a multimodal embedding model and helps users discover it under the correct category on the Hub (https://huggingface.co/models?pipeline_tag=feature-extraction).
- Correcting the placeholder paper link in the content to the official arXiv link: https://arxiv.org/abs/2506.23115.
README.md
CHANGED
|
@@ -1,22 +1,22 @@
|
|
| 1 |
---
|
| 2 |
-
tags:
|
| 3 |
-
- mmeb
|
| 4 |
-
- transformers
|
| 5 |
language:
|
| 6 |
- en
|
| 7 |
library_name: transformers
|
| 8 |
license: mit
|
| 9 |
-
pipeline_tag:
|
|
|
|
|
|
|
|
|
|
| 10 |
---
|
| 11 |
|
| 12 |
## MoCa-Qwen25VL-7B
|
| 13 |
|
| 14 |
-
[MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/
|
| 15 |
|
| 16 |
This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
|
| 17 |
The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
|
| 18 |
|
| 19 |
-
[π Homepage](https://haon-chen.github.io/MoCa/) | [π» Code](https://github.com/haon-chen/MoCa) | [π€ MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [π€ MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [π Datasets](https://huggingface.co/moca-embed/datasets) | [π Paper](https://arxiv.org/abs/2506.23115)
|
| 20 |
|
| 21 |
**Highlights**
|
| 22 |
- SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
|
|
@@ -81,7 +81,8 @@ model = Qwen2_5ForEmbedding.from_pretrained(
|
|
| 81 |
).to("cuda")
|
| 82 |
model.eval()
|
| 83 |
# Image + Text -> Text
|
| 84 |
-
inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image
|
|
|
|
| 85 |
'figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 86 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 87 |
|
|
@@ -98,24 +99,27 @@ print(string, '=', compute_similarity(qry_output, tgt_output))
|
|
| 98 |
## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
|
| 99 |
|
| 100 |
# Text -> Image
|
| 101 |
-
inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog
|
|
|
|
| 102 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 103 |
|
| 104 |
-
string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image
|
|
|
|
| 105 |
tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 106 |
tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 107 |
print(string, '=', compute_similarity(qry_output, tgt_output))
|
| 108 |
## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
|
| 109 |
|
| 110 |
-
inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger
|
|
|
|
| 111 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 112 |
|
| 113 |
-
string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image
|
|
|
|
| 114 |
tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 115 |
tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 116 |
print(string, '=', compute_similarity(qry_output, tgt_output))
|
| 117 |
## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
|
| 118 |
-
```
|
| 119 |
|
| 120 |
|
| 121 |
## Citation
|
|
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
| 2 |
language:
|
| 3 |
- en
|
| 4 |
library_name: transformers
|
| 5 |
license: mit
|
| 6 |
+
pipeline_tag: feature-extraction
|
| 7 |
+
tags:
|
| 8 |
+
- mmeb
|
| 9 |
+
- transformers
|
| 10 |
---
|
| 11 |
|
| 12 |
## MoCa-Qwen25VL-7B
|
| 13 |
|
| 14 |
+
[MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/2506.23115). Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou, arXiv 2025
|
| 15 |
|
| 16 |
This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
|
| 17 |
The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
|
| 18 |
|
| 19 |
+
[π Homepage](https://haon-chen.github.io/MoCa/) | [π» Code](https://github.com/haon-chen/MoCa) | [π€ MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [π€ MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [π Datasets](https://huggingface.co/datasets/moca-embed/datasets) | [π Paper](https://arxiv.org/abs/2506.23115)
|
| 20 |
|
| 21 |
**Highlights**
|
| 22 |
- SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
|
|
|
|
| 81 |
).to("cuda")
|
| 82 |
model.eval()
|
| 83 |
# Image + Text -> Text
|
| 84 |
+
inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image
|
| 85 |
+
', images=[Image.open(
|
| 86 |
'figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 87 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 88 |
|
|
|
|
| 99 |
## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
|
| 100 |
|
| 101 |
# Text -> Image
|
| 102 |
+
inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog.
|
| 103 |
+
', return_tensors="pt").to("cuda")
|
| 104 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 105 |
|
| 106 |
+
string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
|
| 107 |
+
'
|
| 108 |
tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 109 |
tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 110 |
print(string, '=', compute_similarity(qry_output, tgt_output))
|
| 111 |
## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
|
| 112 |
|
| 113 |
+
inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger.
|
| 114 |
+
', return_tensors="pt").to("cuda")
|
| 115 |
qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 116 |
|
| 117 |
+
string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
|
| 118 |
+
'
|
| 119 |
tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
|
| 120 |
tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
|
| 121 |
print(string, '=', compute_similarity(qry_output, tgt_output))
|
| 122 |
## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
|
|
|
|
| 123 |
|
| 124 |
|
| 125 |
## Citation
|