PaliGemma: A versatile 3B VLM for transfer
Paper
• 2407.07726
• Published • 72
Vision language models are blind
Paper
• 2407.06581
• Published • 84
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published • 37
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published • 49
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts
Language Model
Paper
• 2405.04434
• Published • 25
Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
Paper
• 2404.19752
• Published • 24
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model
Handling Resolutions from 336 Pixels to 4K HD
Paper
• 2404.06512
• Published • 30
Sigmoid Loss for Language Image Pre-Training
Paper
• 2303.15343
• Published • 11
CogVLM: Visual Expert for Pretrained Language Models
Paper
• 2311.03079
• Published • 27
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
Comprehension in Vision-Language Large Model
Paper
• 2401.16420
• Published • 55
What matters when building vision-language models?
Paper
• 2405.02246
• Published • 103
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper
• 2411.14402
• Published • 47
SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic
Understanding, Localization, and Dense Features
Paper
• 2502.14786
• Published • 161