nielsr HF Staff commited on
Commit
891a5f4
Β·
verified Β·
1 Parent(s): 611a540

Update pipeline tag and fix paper link in model card

Browse files

This PR improves the model card by:
- Updating the `pipeline_tag` from `zero-shot-image-classification` to `feature-extraction`. This better reflects the model's primary function as a multimodal embedding model and helps users discover it under the correct category on the Hub (https://huggingface.co/models?pipeline_tag=feature-extraction).
- Correcting the placeholder paper link in the content to the official arXiv link: https://arxiv.org/abs/2506.23115.

Files changed (1) hide show
  1. README.md +16 -12
README.md CHANGED
@@ -1,22 +1,22 @@
1
  ---
2
- tags:
3
- - mmeb
4
- - transformers
5
  language:
6
  - en
7
  library_name: transformers
8
  license: mit
9
- pipeline_tag: zero-shot-image-classification
 
 
 
10
  ---
11
 
12
  ## MoCa-Qwen25VL-7B
13
 
14
- [MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/xxxx.pdf). Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou, arXiv 2025
15
 
16
  This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
17
  The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
18
 
19
- [🏠 Homepage](https://haon-chen.github.io/MoCa/) | [πŸ’» Code](https://github.com/haon-chen/MoCa) | [πŸ€– MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [πŸ€– MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [πŸ“š Datasets](https://huggingface.co/moca-embed/datasets) | [πŸ“„ Paper](https://arxiv.org/abs/2506.23115)
20
 
21
  **Highlights**
22
  - SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
@@ -81,7 +81,8 @@ model = Qwen2_5ForEmbedding.from_pretrained(
81
  ).to("cuda")
82
  model.eval()
83
  # Image + Text -> Text
84
- inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image\n', images=[Image.open(
 
85
  'figures/example.jpg')], return_tensors="pt").to("cuda")
86
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
87
 
@@ -98,24 +99,27 @@ print(string, '=', compute_similarity(qry_output, tgt_output))
98
  ## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
99
 
100
  # Text -> Image
101
- inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog.\n', return_tensors="pt").to("cuda")
 
102
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
103
 
104
- string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.\n'
 
105
  tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
106
  tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
107
  print(string, '=', compute_similarity(qry_output, tgt_output))
108
  ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
109
 
110
- inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger.\n', return_tensors="pt").to("cuda")
 
111
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
112
 
113
- string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.\n'
 
114
  tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
115
  tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
116
  print(string, '=', compute_similarity(qry_output, tgt_output))
117
  ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
118
- ```
119
 
120
 
121
  ## Citation
 
1
  ---
 
 
 
2
  language:
3
  - en
4
  library_name: transformers
5
  license: mit
6
+ pipeline_tag: feature-extraction
7
+ tags:
8
+ - mmeb
9
+ - transformers
10
  ---
11
 
12
  ## MoCa-Qwen25VL-7B
13
 
14
+ [MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings](https://arxiv.org/abs/2506.23115). Haonan Chen, Hong Liu, Yuping Luo, Liang Wang, Nan Yang, Furu Wei, Zhicheng Dou, arXiv 2025
15
 
16
  This repo presents the `MoCa-Qwen25VL` series of **multimodal embedding models**.
17
  The model is trained based on [Qwen2.5-7B-VL-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-VL-Instruct).
18
 
19
+ [🏠 Homepage](https://haon-chen.github.io/MoCa/) | [πŸ’» Code](https://github.com/haon-chen/MoCa) | [πŸ€– MoCa-Qwen25VL-7B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-7B) | [πŸ€– MoCa-Qwen25VL-3B](https://huggingface.co/moca-embed/MoCa-Qwen25VL-3B) | [πŸ“š Datasets](https://huggingface.co/datasets/moca-embed/datasets) | [πŸ“„ Paper](https://arxiv.org/abs/2506.23115)
20
 
21
  **Highlights**
22
  - SOTA performance on MMEB (General Multimodal) and surpassing many strong baselines on ViDoRe-v2 (Document Retrieval).
 
81
  ).to("cuda")
82
  model.eval()
83
  # Image + Text -> Text
84
+ inputs = processor(text='<|vision_start|><|image_pad|><|vision_end|>Represent the given image with the following question: What is in the image
85
+ ', images=[Image.open(
86
  'figures/example.jpg')], return_tensors="pt").to("cuda")
87
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
88
 
 
99
  ## A cat and a tiger = tensor([[0.4551]], device='cuda:0', dtype=torch.bfloat16)
100
 
101
  # Text -> Image
102
+ inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a dog.
103
+ ', return_tensors="pt").to("cuda")
104
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
105
 
106
+ string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
107
+ '
108
  tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
109
  tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
110
  print(string, '=', compute_similarity(qry_output, tgt_output))
111
  ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.4395]], device='cuda:0', dtype=torch.bfloat16)
112
 
113
+ inputs = processor(text='Find me an everyday image that matches the given caption: A cat and a tiger.
114
+ ', return_tensors="pt").to("cuda")
115
  qry_output = F.normalize(model(**inputs, return_dict=True, output_hidden_states=True), dim=-1)
116
 
117
+ string = '<|vision_start|><|image_pad|><|vision_end|>Represent the given image.
118
+ '
119
  tgt_inputs = processor(text=string, images=[Image.open('figures/example.jpg')], return_tensors="pt").to("cuda")
120
  tgt_output = F.normalize(model(**tgt_inputs, return_dict=True, output_hidden_states=True), dim=-1)
121
  print(string, '=', compute_similarity(qry_output, tgt_output))
122
  ## <|vision_start|><|image_pad|><|vision_end|>Represent the given image. = tensor([[0.3242]], device='cuda:0', dtype=torch.bfloat16)
 
123
 
124
 
125
  ## Citation