TomoroAI/tomoro-colqwen3-embed-8b

⚑ Executive Summary

TomoroAI/tomoro-colqwen3-embed-8b is a state-of-the-art ColPali-style multimodal embedding model. It maps text queries, visual documents (images, PDFs) or short videos into aligned multi-vector embeddings.

Built by merging Qwen/Qwen3-VL-8B-Instruct with Qwen/Qwen3-Embedding-8B, this model inherits robust text retrieval capabilities while preserving a full vision stack. It has been fine-tuned on a curated mixture of VDR, ViDoRe-ColPali-Training, VisRAG-Ret-Train-Synthetic-data, and VisRAG-Ret-Train-In-domain-data. It achieves SOTA or competitive performance across ViDoRe V1-V3 (English and Multilingual) while offering a significantly reduced embedding footprint compared to other full-dim Colpali model alternatives.

πŸ› οΈ Model Specifications

Feature Detail
Architecture Qwen3-VL 8B (Encoder-only variant) + 320-dim Projection Head
Methodology ColPali-style Late Interaction (MaxSim scoring)
Token Budget Up to 1,280 visual tokens per page (text prompts constrained only by the base context window)
Context Window 32k (inherited from base), typical usage < 2k tokens
Output Multi-vector (Seq_Len Γ— 320), L2-normalized
Supported Modalities Text Queries, RGB Images, Synthetic Documents, Short Video (Frame-wise)
Precision bfloat16 weights, FlashAttention 2 enabled

Key Properties

  • Merged Encoders: Combines the Qwen3-VL vision encoder (patch-grid tokens with spatial merge) and language encoder.
  • Projection: A custom 320-dim head projects every token (text or visual) into a vector.
  • Processing:
    • Queries: Left-padded text sequences.
    • Documents: Rendered with a lightweight vision prompt and flattened into image tokens.
    • Video: Supports video retrieval by decoding videos into frames and processing via the vision stack (generalization capability, not explicitly fine-tuned; dedicated benchmark coming soon).
  • Storage Efficiency:
    • Baseline (NVIDIA Nemo-3B): Stores 1,802 tokens @ 3,072 dims (β‰ˆ10.3 TB for 1M images).
    • Tomoro ColQwen3: Stores max 1,280 tokens @ 320 dims (β‰ˆ0.82 TB for 1M images).
    • Result: 13Γ— smaller footprint with higher performance.

πŸ“Š Evaluation Results

We report results on the ViDoRe benchmark suite. The model sets new standards on multilingual and English splits on ViDoRe V2 and V3 while maintaining comparable high performance on ViDoRe V1.

ViDoRe V3 (Latest)

English nDCG@5

Model CompSci Energy FinanceEn FinanceFr HR Industrial Pharma Physics Avg
tomoro-colqwen3-8b 0.7443 0.6491 0.6823 0.4546 0.6421 0.5766 0.6665 0.4747 0.6113
tomoro-colqwen3-4b 0.7419 0.6023 0.6753 0.4202 0.6037 0.5787 0.6612 0.4640 0.5934
nemo-colembed-3b 0.7514 0.5838 0.6712 0.3730 0.6256 0.5447 0.6524 0.4128 0.5769
jinaai/jina-embeddings-v4 0.7175 0.5842 0.6417 0.3859 0.6206 0.5443 0.6303 0.4191 0.5680
nomic-ai/colnomic-embed-multimodal-7b 0.7528 0.5824 0.6041 0.3877 0.6060 0.5229 0.6226 0.4423 0.5651

Multilingual nDCG@5 (Excluding English Subsets)

Model CompSci Energy FinanceEn FinanceFr HR Industrial Pharma Physics Avg
tomoro-colqwen3-8b 0.7194 0.6619 0.6172 0.4570 0.6097 0.5164 0.6403 0.4706 0.5866
tomoro-colqwen3-4b 0.7213 0.6374 0.6019 0.4305 0.5637 0.5131 0.6351 0.4636 0.5708
nemo-colembed-3b 0.7216 0.5901 0.5646 0.4102 0.5504 0.4335 0.6170 0.4192 0.5383
jinaai/jina-embeddings-v4 0.6843 0.6036 0.5482 0.4249 0.5542 0.4732 0.6059 0.4381 0.5416
nomic-ai/colnomic-embed-multimodal-7b 0.7333 0.6160 0.5219 0.4169 0.5494 0.4764 0.5938 0.4449 0.5441

ViDoRe V2

English nDCG@5

Model BioMed ESG HL ESG Rpts Economics Avg
tomoro-colqwen3-8b 0.6784 0.7598 0.6549 0.6159 0.6772
tomoro-colqwen3-4b 0.6718 0.7465 0.6300 0.5910 0.6598
nemo-colembed-3b 0.6518 0.7538 0.6030 0.6619 0.6676
jinaai/jina-embeddings-v4 0.6359 0.6512 0.5194 0.5955 0.6005
nomic-ai/colnomic-embed-multimodal-7b 0.6479 0.6871 0.5498 0.5955 0.6201

Multilingual nDCG@5

Model BioMed ESG Rpts Economics Avg
tomoro-colqwen3-8b 0.6467 0.5911 0.5875 0.6085
tomoro-colqwen3-4b 0.6478 0.6226 0.5536 0.6080
nemo-colembed-3b 0.6187 0.5640 0.5506 0.5778
jinaai/jina-embeddings-v4 0.5994 0.5178 0.5364 0.5512
nomic-ai/colnomic-embed-multimodal-7b 0.6224 0.5336 0.5433 0.5664

ViDoRe V1 (English nDCG@5)

Model ArxivQA DocVQA InfoVQA Shift Syn-AI Syn-Eng Syn-Gov Syn-Health TabFQuAD Tatdqa Avg
tomoro-colqwen3-8b 0.9115 0.6637 0.9448 0.8789 0.9926 0.9671 0.9758 0.9906 0.9423 0.8092 0.9076
tomoro-colqwen3-4b 0.9066 0.6624 0.9429 0.8739 0.9926 0.9691 0.9717 0.9963 0.9433 0.7983 0.9057
nemo-colembed-3b 0.8835 0.6621 0.9492 0.9070 0.9963 0.9663 0.9782 0.9926 0.9594 0.8057 0.9100
jinaai/jina-embeddings-v4 0.8846 0.6014 0.9379 0.9293 0.9926 0.9726 0.9659 0.9913 0.9560 0.8035 0.9035
nomic-ai/colnomic-embed-multimodal-7b 0.8832 0.6011 0.9221 0.8930 0.9876 0.9626 0.9592 0.9926 0.9596 0.8108 0.8972

πŸ’» Usage

The processor exposes process_texts, process_images, and score_multi_vector.

Prerequisites

pip install torch transformers pillow requests

Inference Code

import torch
from transformers import AutoModel, AutoProcessor
from PIL import Image, UnidentifiedImageError
import requests
from io import BytesIO

# Configuration
MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

# Load Model & Processor
processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

# Sample Data
queries = [
    "Retrieve the city of Singapore",
    "Retrieve the city of Beijing",
    "Retrieve the city of London",
]
docs = [
    "https://upload.wikimedia.org/wikipedia/commons/2/27/Singapore_skyline_2022.jpg",
    "https://upload.wikimedia.org/wikipedia/commons/6/61/Beijing_skyline_at_night.JPG",
    "https://upload.wikimedia.org/wikipedia/commons/4/49/London_skyline.jpg",
]

def load_image(url: str) -> Image.Image:
    # Some CDNs (e.g., Wikimedia) expect a browser-like UA to avoid 403s.
    for headers in ({}, {"User-Agent": "Mozilla/5.0 (compatible; ColQwen3-demo/1.0)"}):
        resp = requests.get(url, headers=headers, timeout=10)
        if resp.status_code == 403:
            continue
        resp.raise_for_status()
        try:
            return Image.open(BytesIO(resp.content)).convert("RGB")
        except UnidentifiedImageError as e:
            raise RuntimeError(f"Failed to decode image from {url}") from e
    raise RuntimeError(f"Could not fetch image (HTTP 403) from {url}; try downloading locally and loading from file path.")

# Helper Functions
def encode_queries(texts, batch_size=8):
    outputs = []
    for start in range(0, len(texts), batch_size):
        batch = processor.process_texts(texts=texts[start : start + batch_size])
        batch = {k: v.to(DEVICE) for k, v in batch.items()}
        with torch.inference_mode():
            out = model(**batch)
            vecs = out.embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

def encode_docs(urls, batch_size=4):
    pil_images = [load_image(url) for url in urls]
    outputs = []
    for start in range(0, len(pil_images), batch_size):
        batch_imgs = pil_images[start : start + batch_size]
        features = processor.process_images(images=batch_imgs)
        features = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in features.items()}
        with torch.inference_mode():
            out = model(**features)
            vecs = out.embeddings.to(torch.bfloat16).cpu()
        outputs.extend(vecs)
    return outputs

# Execution
query_embeddings = encode_queries(queries)
doc_embeddings = encode_docs(docs)

# MaxSim Scoring
scores = processor.score_multi_vector(query_embeddings, doc_embeddings)
print(scores)

🎞️ Lightweight Video Retrieval

ColQwen3 generalizes to short videos while learning from image-text retrieval task. This minimal example samples a clip with torchvision, encodes queries and frames, then pools frame embeddings with a per-dimension max before MaxSim scoring.

from pathlib import Path
from typing import Any, Dict

import torch
from torch.nn.utils.rnn import pad_sequence
from transformers import AutoModel, AutoProcessor

MODEL_ID = "TomoroAI/tomoro-colqwen3-embed-8b"
DTYPE = torch.bfloat16
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"

processor = AutoProcessor.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    max_num_visual_tokens=1280,
)
model = AutoModel.from_pretrained(
    MODEL_ID,
    dtype=DTYPE,
    attn_implementation="flash_attention_2",
    trust_remote_code=True,
    device_map=DEVICE,
).eval()

queries = ["Retrieve the football video", "Find the basketball clip"]
videos = ["sample_videos/football.mp4", "sample_videos/basketball.mp4"]


def _pad_video_sequences(batch: Dict[str, Any], video_paths: list[str]) -> Dict[str, Any]:
    """Recompute and pad flattened video patches so they align with grid metadata."""
    pixel_values = batch.get("pixel_values_videos")
    grid = batch.get("video_grid_thw")
    if not isinstance(pixel_values, torch.Tensor) or not isinstance(grid, torch.Tensor):
        return batch
    if pixel_values.ndim == 3:
        return batch

    rebuilt = processor.video_processor(
        videos=video_paths,
        return_tensors="pt",
        return_metadata=True,
        data_format="channels_first",
        do_convert_rgb=True,
    )
    seq_grid = rebuilt["video_grid_thw"]
    flat_pixels = rebuilt["pixel_values_videos"]
    offsets = (seq_grid[:, 0] * seq_grid[:, 1] * seq_grid[:, 2]).tolist()

    sequences = []
    cursor = 0
    for offset in offsets:
        next_cursor = cursor + offset
        sequences.append(flat_pixels[cursor:next_cursor])
        cursor = next_cursor

    batch["pixel_values_videos"] = pad_sequence(sequences, batch_first=True)
    batch["video_grid_thw"] = seq_grid
    if "video_metadata" in batch:
        batch["video_metadata"] = rebuilt.get("video_metadata", batch["video_metadata"])
    return batch


def encode_queries(texts):
    batch = processor.process_texts(texts=texts)
    batch = {k: v.to(DEVICE) for k, v in batch.items()}
    with torch.inference_mode():
        out = model(**batch)
    return out.embeddings.to(torch.bfloat16).cpu()

def encode_videos(paths):
    vids = [str(Path(p).expanduser()) for p in paths]
    feats = processor(
        videos=vids,
        return_tensors="pt",
        padding="longest",
        videos_kwargs={"return_metadata": True},
    )
    feats = _pad_video_sequences(feats, vids)
    feats.pop("video_metadata", None)  # drop metadata before forwarding to the model
    feats = {k: v.to(DEVICE) if isinstance(v, torch.Tensor) else v for k, v in feats.items()}
    with torch.inference_mode():
        out = model(**feats)
    return out.embeddings.to(torch.bfloat16).cpu()

q_emb = encode_queries(queries)
v_emb = encode_videos(videos)
scores = processor.score_multi_vector(q_emb, v_emb)
print(scores)

βš–οΈ Strengths & Limitations

Strengths

  • Performance: State of the art retrieval performance on ViDoRe V2 & V3 dataset with excellent performance on multimodal document retrieval.
  • Complex Layouts: Excellent handling of chart-rich PDFs, domain-specific documents.
  • End-to-end Retrieval: Capable of OCR-free retrieval on unseen multimodal documents without using an intermediate vision LLM to generate summary for retrieval.
  • Retrieval Task Transfer: Inherited strong text retrieval performance from the merged vector of the Qwen3-Embedding-8B model.
  • Multilingualism: Strong performance on non-English document inputs.

Limitations

  • Video Support: The retrieval model generalizes to video retrieval on our preliminary findings, however it's not fine-tuned on large-scale video retrieval datasets, we plan to further improve this in the future.
  • Storage Cost: Still larger than single‑vector baselines despite the smaller token dimension.
  • Retrieval Instructions: The model currently is not fine-tuned with diverse system instructions similar to Qwen3-Embedding models, we intent do improve this with more synthetic dataset in the future.

License & Data

Distributed under Apache 2.0.

  • Weights: Upstream Qwen checkpoints retain their community licenses; ensure compliance when mixing.
  • Data: Training data includes ViDoRe/MTEB corpora and synthetic VisRAG assets.

Acknowledgement

We gratefully acknowledge the support of Tomoro AI, a leading AI engineering firm dedicated to delivering high-quality enterprise solutions that accelerate complex R&D and business transformation. This work is directly applied to enhance Tomoro’s customized multimodal agentic RAG pipelines, empowering the autonomous agents to parse, reason over, and retrieve from large-scale enterprise internal documentation. By bridging the gap between vision and language, this model supports Tomoro AI's mission to accelerate the delivery of high-quality enterprise multimodal solutions and deploy robust, production-grade intelligence across high-stakes industries.

πŸ“š Citation

If you use this model, please cite:

@misc{huang2025tomoro_colqwen3_embed,
  title        = {TomoroAI/tomoro-colqwen3-embed},
  author       = {Xin Huang and Kye Min Tan and Albert Phelps},
  year         = {2025},
  url = {https://huggingface.co/TomoroAI/tomoro-colqwen3-embed-8b}
}
Downloads last month
-
Safetensors
Model size
9B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TomoroAI/tomoro-colqwen3-embed-8b

Finetuned
(67)
this model