- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2504.10465 
						
					
				- 
	
	
	
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 8 - 
	
	
	
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 - 
	
	
	
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 19 - 
	
	
	
				OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 562 • 42 
- 
	
	
	
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 - 
	
	
	
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 - 
	
	
	
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 - 
	
	
	
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34 
- 
	
	
	
Seed-Coder: Let the Code Model Curate Data for Itself
Paper • 2506.03524 • Published • 6 - 
	
	
	
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Paper • 2504.13914 • Published • 4 - 
	
	
	
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Paper • 2503.10772 • Published • 19 - 
	
	
	
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Paper • 2503.09949 • Published • 5 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 - 
	
	
	
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
Seed-Coder: Let the Code Model Curate Data for Itself
Paper • 2506.03524 • Published • 6 - 
	
	
	
Seed1.5-Thinking: Advancing Superb Reasoning Models with Reinforcement Learning
Paper • 2504.13914 • Published • 4 - 
	
	
	
FlowTok: Flowing Seamlessly Across Text and Image Tokens
Paper • 2503.10772 • Published • 19 - 
	
	
	
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos?
Paper • 2503.09949 • Published • 5 
- 
	
	
	
FocusedAD: Character-centric Movie Audio Description
Paper • 2504.12157 • Published • 8 - 
	
	
	
Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
Paper • 2504.10465 • Published • 27 - 
	
	
	
PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding
Paper • 2504.13180 • Published • 19 - 
	
	
	
				OS-Copilot/OS-Atlas-Base-7B
Image-Text-to-Text • 8B • Updated • 562 • 42 
- 
	
	
	
MLLM-as-a-Judge for Image Safety without Human Labeling
Paper • 2501.00192 • Published • 31 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
Xmodel-2 Technical Report
Paper • 2412.19638 • Published • 26 - 
	
	
	
HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs
Paper • 2412.18925 • Published • 104 
- 
	
	
	
iVideoGPT: Interactive VideoGPTs are Scalable World Models
Paper • 2405.15223 • Published • 17 - 
	
	
	
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
Paper • 2405.15574 • Published • 55 - 
	
	
	
An Introduction to Vision-Language Modeling
Paper • 2405.17247 • Published • 90 - 
	
	
	
Matryoshka Multimodal Models
Paper • 2405.17430 • Published • 34 
- 
	
	
	
BLINK: Multimodal Large Language Models Can See but Not Perceive
Paper • 2404.12390 • Published • 26 - 
	
	
	
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
Paper • 2404.12803 • Published • 30 - 
	
	
	
Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Models
Paper • 2404.13013 • Published • 31 - 
	
	
	
InternLM-XComposer2-4KHD: A Pioneering Large Vision-Language Model Handling Resolutions from 336 Pixels to 4K HD
Paper • 2404.06512 • Published • 30