Vision-Language Model
updated
Visual Instruction Tuning
Paper
• 2304.08485
• Published
• 21
Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities
Paper
• 2308.12966
• Published
• 11
Improved Baselines with Visual Instruction Tuning
Paper
• 2310.03744
• Published
• 39
SILC: Improving Vision Language Pretraining with Self-Distillation
Paper
• 2310.13355
• Published
• 9
CogVLM: Visual Expert for Pretrained Language Models
Paper
• 2311.03079
• Published
• 27
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
Paper
• 2311.12793
• Published
• 18
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Paper
• 2403.05525
• Published
• 49
OmniFusion Technical Report
Paper
• 2404.06212
• Published
• 77
Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
Models
Paper
• 2404.12387
• Published
• 39
Pegasus-v1 Technical Report
Paper
• 2404.14687
• Published
• 33
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal
Models with Open-Source Suites
Paper
• 2404.16821
• Published
• 59
PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video
Dense Captioning
Paper
• 2404.16994
• Published
• 37
What matters when building vision-language models?
Paper
• 2405.02246
• Published
• 103
InternLM-XComposer-2.5: A Versatile Large Vision Language Model
Supporting Long-Contextual Input and Output
Paper
• 2407.03320
• Published
• 94
Understanding Alignment in Multimodal LLMs: A Comprehensive Study
Paper
• 2407.02477
• Published
• 24
μ-Bench: A Vision-Language Benchmark for Microscopy Understanding
Paper
• 2407.01791
• Published
• 6