ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning Paper • 2510.27492 • Published 4 days ago • 54
Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published 8 days ago • 6
Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published 8 days ago • 6 • 1
SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding Paper • 2408.14764 • Published Aug 27, 2024
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Paper • 2409.12191 • Published Sep 18, 2024 • 78
CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy Paper • 2412.02210 • Published Dec 3, 2024
Revisiting Multimodal Positional Encoding in Vision-Language Models Paper • 2510.23095 • Published 8 days ago • 6
Qwen/Qwen3-VL-235B-A22B-Instruct Image-Text-to-Text • 236B • Updated about 1 month ago • 53.1k • • 307