TomoroAI/tomoro-colqwen3-embed-8b
β‘ Executive Summary
TomoroAI/tomoro-colqwen3-embed-8b is a state-of-the-art ColPali-style multimodal embedding model. It maps text queries, visual documents (images, PDFs) or short videos into aligned multi-vector embeddings.
Built by merging Qwen/Qwen3-VL-8B-Instruct with Qwen/Qwen3-Embedding-8B, this model inherits robust text retrieval capabilities while preserving a full vision stack. It has been fine-tuned on a curated mixture of VDR, ViDoRe-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, and VisRAG-Ret-Train-In-domain-data. It achieves SOTA or competitive performance across ViDoRe V1-V3 (English and Multilingual) while offering a significantly reduced embedding footprint compared to other full-dim Colpali model alternatives.
π οΈ Model Specifications
| Feature | Detail |
|---|---|
| Architecture | Qwen3-VL 8B (Encoder-only variant) + 320-dim Projection Head |
| Methodology | ColPali-style Late Interaction (MaxSim scoring) |
| Token Budget | Up to 1,280 visual tokens per page (text prompts constrained only by the base context window) |
| Context Window | 32k (inherited from base), typical usage < 2k tokens |
| Output | Multi-vector (Seq_Len Γ 320), L2-normalized |
| Supported Modalities | Text Queries, RGB Images, Synthetic Documents, Short Video (Frame-wise) |
| Precision | bfloat16 weights, FlashAttention 2 enabled |
Key Properties
- Merged Encoders: Combines the Qwen3-VL vision encoder (patch-grid tokens with spatial merge) and language encoder.
- Projection: A custom 320-dim head projects every token (text or visual) into a vector.
- Processing:
- Queries: Left-padded text sequences.
- Documents: Rendered with a lightweight vision prompt and flattened into image tokens.
- Video: Supports video retrieval by decoding videos into frames and processing via the vision stack (generalization capability, not explicitly fine-tuned; dedicated benchmark coming soon).
- Storage Efficiency:
- Baseline (NVIDIA Nemo-3B): Stores 1,802 tokens @ 3,072 dims (β10.3 TB for 1M images).
- Tomoro ColQwen3: Stores max 1,280 tokens @ 320 dims (β0.82 TB for 1M images).
- Result: 13Γ smaller footprint with higher performance.
π Evaluation Results
We report results on the ViDoRe benchmark suite. The model sets new standards on multilingual and English splits on ViDoRe V2 and V3 while maintaining comparable high performance on ViDoRe V1.
ViDoRe V3 (Latest)
English nDCG@5
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | Avg |
|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.7443 | 0.6491 | 0.6823 | 0.4546 | 0.6421 | 0.5766 | 0.6665 | 0.4747 | 0.6113 |
| tomoro-colqwen3-4b | 0.7419 | 0.6023 | 0.6753 | 0.4202 | 0.6037 | 0.5787 | 0.6612 | 0.4640 | 0.5934 |
| nemo-colembed-3b | 0.7514 | 0.5838 | 0.6712 | 0.3730 | 0.6256 | 0.5447 | 0.6524 | 0.4128 | 0.5769 |
| jinaai/jina-embeddings-v4 | 0.7175 | 0.5842 | 0.6417 | 0.3859 | 0.6206 | 0.5443 | 0.6303 | 0.4191 | 0.5680 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.7528 | 0.5824 | 0.6041 | 0.3877 | 0.6060 | 0.5229 | 0.6226 | 0.4423 | 0.5651 |
Multilingual nDCG@5 (Excluding English Subsets)
| Model | CompSci | Energy | FinanceEn | FinanceFr | HR | Industrial | Pharma | Physics | Avg |
|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.7194 | 0.6619 | 0.6172 | 0.4570 | 0.6097 | 0.5164 | 0.6403 | 0.4706 | 0.5866 |
| tomoro-colqwen3-4b | 0.7213 | 0.6374 | 0.6019 | 0.4305 | 0.5637 | 0.5131 | 0.6351 | 0.4636 | 0.5708 |
| nemo-colembed-3b | 0.7216 | 0.5901 | 0.5646 | 0.4102 | 0.5504 | 0.4335 | 0.6170 | 0.4192 | 0.5383 |
| jinaai/jina-embeddings-v4 | 0.6843 | 0.6036 | 0.5482 | 0.4249 | 0.5542 | 0.4732 | 0.6059 | 0.4381 | 0.5416 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.7333 | 0.6160 | 0.5219 | 0.4169 | 0.5494 | 0.4764 | 0.5938 | 0.4449 | 0.5441 |
ViDoRe V2
English nDCG@5
| Model | BioMed | ESG HL | ESG Rpts | Economics | Avg |
|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.6784 | 0.7598 | 0.6549 | 0.6159 | 0.6772 |
| tomoro-colqwen3-4b | 0.6718 | 0.7465 | 0.6300 | 0.5910 | 0.6598 |
| nemo-colembed-3b | 0.6518 | 0.7538 | 0.6030 | 0.6619 | 0.6676 |
| jinaai/jina-embeddings-v4 | 0.6359 | 0.6512 | 0.5194 | 0.5955 | 0.6005 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.6479 | 0.6871 | 0.5498 | 0.5955 | 0.6201 |
Multilingual nDCG@5
| Model | BioMed | ESG Rpts | Economics | Avg |
|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.6467 | 0.5911 | 0.5875 | 0.6085 |
| tomoro-colqwen3-4b | 0.6478 | 0.6226 | 0.5536 | 0.6080 |
| nemo-colembed-3b | 0.6187 | 0.5640 | 0.5506 | 0.5778 |
| jinaai/jina-embeddings-v4 | 0.5994 | 0.5178 | 0.5364 | 0.5512 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.6224 | 0.5336 | 0.5433 | 0.5664 |
ViDoRe V1 (English nDCG@5)
| Model | ArxivQA | DocVQA | InfoVQA | Shift | Syn-AI | Syn-Eng | Syn-Gov | Syn-Health | TabFQuAD | Tatdqa | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|
| tomoro-colqwen3-8b | 0.9115 | 0.6637 | 0.9448 | 0.8789 | 0.9926 | 0.9671 | 0.9758 | 0.9906 | 0.9423 | 0.8092 | 0.9076 |
| tomoro-colqwen3-4b | 0.9066 | 0.6624 | 0.9429 | 0.8739 | 0.9926 | 0.9691 | 0.9717 | 0.9963 | 0.9433 | 0.7983 | 0.9057 |
| nemo-colembed-3b | 0.8835 | 0.6621 | 0.9492 | 0.9070 | 0.9963 | 0.9663 | 0.9782 | 0.9926 | 0.9594 | 0.8057 | 0.9100 |
| jinaai/jina-embeddings-v4 | 0.8846 | 0.6014 | 0.9379 | 0.9293 | 0.9926 | 0.9726 | 0.9659 | 0.9913 | 0.9560 | 0.8035 | 0.9035 |
| nomic-ai/colnomic-embed-multimodal-7b | 0.8832 | 0.6011 | 0.9221 | 0.8930 | 0.9876 | 0.9626 | 0.9592 | 0.9926 | 0.9596 | 0.8108 | 0.8972 |
π» Usage
The processor exposes process_texts, process_images, and score_multi_vector.
Prerequisites
pip install torch transformers pillow requests
Inference Code
import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO
# Configuration
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load Model & Processor
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
# Sample Data
queries = [
"Retrieve the city of Singapore",
"Retrieve the city of Beijing",
"Retrieve the city of London",
]
docs = [
"https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
"https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
"https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
]
def load_image(url: str) -> Image.Image:
# Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
resp = requests.get(url, headers=headers, timeout=10)
if resp.status_code == 403:
continue
resp.raise_for_status()
try:
return Image.open(BytesIO(resp.content)).convert("RGB")
except UnidentifiedImageError as e:
raise RuntimeError(f"Failed to decode image from {url}") from e
raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")
# Helper Functions
def encode_queries(texts, batch_size=8):
outputs = []
for start in range(0, len(texts), batch_size):
batch = processor.process_texts(texts=texts[start : start + batch_size])
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
def encode_docs(urls, batch_size=4):
pil_images = [load_image(url) for url in urls]
outputs = []
for start in range(0, len(pil_images), batch_size):
batch_imgs = pil_images[start : start + batch_size]
features = processor.process_images(images=batch_imgs)
features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
with torch.inference_mode():
out = model(**features)
vecs = out.embeddings.to(torch.bfloat16).cpu()
outputs.extend(vecs)
return outputs
# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)
# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)
ποΈ Lightweight Video Retrieval
ColQwen3 generalizes to short videos while learning from image-text retrieval task. This minimal example samples a clip with torchvision, encodes queries and frames, then pools frame embeddings with a per-dimension max before MaxSim scoring.
from pathlib import Path
from typing import Any, Dict
import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoModel, AutoProcessor
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(
MODEL_ID,
trust_remote_code=True,
max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
MODEL_ID,
dtype=DTYPE,
attn_implementation="flash_attention_2",
trust_remote_code=True,
device_map=DEVICE,
).eval()
queries = ["Retrieve the football video", "Find the basketball clip"]
videos = ["sample_videos/football.mp4", "sample_videos/basketball.mp4"]
def _pad_video_sequences(batch: Dict[str, Any], video_paths: list[str]) -> Dict[str, Any]:
"""Recompute and pad flattened video patches so they align with grid metadata."""
pixel_values = batch.get("pixel_values_videos")
grid = batch.get("video_grid_thw")
if not isinstance(pixel_values, torch.Tensor) or not isinstance(grid, torch.Tensor):
return batch
if pixel_values.ndim == 3:
return batch
rebuilt = processor.video_processor(
videos=video_paths,
return_tensors="pt",
return_metadata=True,
data_format="channels_first",
do_convert_rgb=True,
)
seq_grid = rebuilt["video_grid_thw"]
flat_pixels = rebuilt["pixel_values_videos"]
offsets = (seq_grid[:, 0] * seq_grid[:, 1] * seq_grid[:, 2]).tolist()
sequences = []
cursor = 0
for offset in offsets:
next_cursor = cursor + offset
sequences.append(flat_pixels[cursor:next_cursor])
cursor = next_cursor
batch["pixel_values_videos"] = pad_sequence(sequences, batch_first=True)
batch["video_grid_thw"] = seq_grid
if "video_metadata" in batch:
batch["video_metadata"] = rebuilt.get("video_metadata", batch["video_metadata"])
return batch
def encode_queries(texts):
batch = processor.process_texts(texts=texts)
batch = {k: v.to(DEVICE) for k, v in batch.items()}
with torch.inference_mode():
out = model(**batch)
return out.embeddings.to(torch.bfloat16).cpu()
def encode_videos(paths):
vids = [str(Path(p).expanduser()) for p in paths]
feats = processor(
videos=vids,
return_tensors="pt",
padding="longest",
videos_kwargs={"return_metadata": True},
)
feats = _pad_video_sequences(feats, vids)
feats.pop("video_metadata", None) # drop metadata before forwarding to the model
feats = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in feats.items()}
with torch.inference_mode():
out = model(**feats)
return out.embeddings.to(torch.bfloat16).cpu()
q_emb = encode_queries(queries)
v_emb = encode_videos(videos)
scores = processor.score_multi_vector(q_emb, v_emb)
print(scores)
βοΈ Strengths & Limitations
Strengths
- Performance: State of the art retrieval performance on ViDoRe V2 & V3 dataset with excellent performance on multimodal document retrieval.
- Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
- End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
- Retrieval Task Transfer: Inherited strong text retrieval performance from the merged vector of the Qwen3-Embedding-8B model.
- Multilingualism: Strong performance on non-English document inputs.
Limitations
- Video Support: The retrieval model generalizes to video retrieval on our preliminary findings, however it's not fine-tuned on large-scale video retrieval datasets, we plan to further improve this in the future.
- Storage Cost: Still larger than singleβvector baselines despite the smaller token dimension.
- Retrieval Instructions: The model currently is not fine-tuned with diverse system instructions similar to Qwen3-Embedding models, we intent do improve this with more synthetic dataset in the future.
License & Data
Distributed under Apache 2.0.
- Weights: Upstream Qwen checkpoints retain their community licenses; ensure compliance when mixing.
- Data: Training data includes ViDoRe/MTEB corpora and synthetic VisRAG assets.
Acknowledgement
We gratefully acknowledge the support of Tomoro AI, a leading AI engineering firm dedicated to delivering high-quality enterprise solutions that accelerate complex R&D and business transformation. This work is directly applied to enhance Tomoroβs customized multimodal agentic RAG pipelines, empowering the autonomous agents to parse, reason over, and retrieve from large-scale enterprise internal documentation. By bridging the gap between vision and language, this model supports Tomoro AI's mission to accelerate the delivery of high-quality enterprise multimodal solutions and deploy robust, production-grade intelligence across high-stakes industries.
π Citation
If you use this model, please cite:
@misc{huang2025tomoro_colqwen3_embed,
title = {TomoroAI/tomoro-colqwen3-embed},
author = {Xin Huang and Kye Min Tan and Albert Phelps},
year = {2025},
url = {https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b}
}
- Downloads last month
- -
Model tree for TomoroAI/tomoro-colqwen3-embed-8b
Base model
Qwen/Qwen3-VL-8B-Instruct