Abstract
Jina-VLM, a 2.4B parameter vision-language model, achieves top performance in multilingual visual question answering using a SigLIP2 vision encoder and Qwen3 language backbone with an attention-pooling connector.
We present Jina-VLM, a 2.4B parameter vision-language model that achieves state-of-the-art multilingual visual question answering among open 2B-scale VLMs. The model couples a SigLIP2 vision encoder with a Qwen3 language backbone through an attention-pooling connector that enables token-efficient processing of arbitrary-resolution images. Across standard VQA benchmarks and multilingual evaluations, Jina-VLM outperforms comparable models while preserving competitive text-only performance.
Community
our latest multilingual vlm model at 2b size, about to release soon
we take 2 apache-2.0 components, combine them and release as cc-by-nc
Bravo.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- NVIDIA Nemotron Nano V2 VL (2025)
- Text-Guided Semantic Image Encoder (2025)
- TowerVision: Understanding and Improving Multilinguality in Vision-Language Models (2025)
- From Pixels to Words - Towards Native Vision-Language Primitives at Scale (2025)
- EM-KD: Distilling Efficient Multimodal Large Language Model with Unbalanced Vision Tokens (2025)
- Topological Alignment of Shared Vision-Language Embedding Space (2025)
- Attention Guided Alignment in Efficient Vision-Language Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper