Papers
arxiv:2512.04032

Jina-VLM: Small Multilingual Vision Language Model

Published on Dec 3
· Submitted by Han Xiao on Dec 4
Authors:
,
,
,
,

Abstract

Jina-VLM, a 2.4B parameter vision-language model, achieves top performance in multilingual visual question answering using a SigLIP2 vision encoder and Qwen3 language backbone with an attention-pooling connector.

AI-generated summary

We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.

Community

Paper author Paper submitter

our latest multilingual vlm model at 2b size, about to release soon

we take 2 apache-2.0 components, combine them and release as cc-by-nc

Bravo.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 1

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2512.04032 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2512.04032 in a Space README.md to link it from this page.

Collections including this paper 2