- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2501.09781 
						
					
				- 
	
	
	
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
Paper • 2105.09501 • Published - 
	
	
	
Cross-modal Contrastive Learning for Speech Translation
Paper • 2205.02444 • Published - 
	
	
	
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Paper • 2210.03052 • Published - 
	
	
	
Diffusion Glancing Transformer for Parallel Sequence to Sequence Learning
Paper • 2212.10240 • Published • 1 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Paper • 2501.10132 • Published • 22 
- 
	
	
	
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 44 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 85 - 
	
	
	
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28 
- 
	
	
	
CompCap: Improving Multimodal Large Language Models with Composite Captions
Paper • 2412.05243 • Published • 20 - 
	
	
	
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper • 2412.04814 • Published • 47 - 
	
	
	
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Paper • 2412.05237 • Published • 46 - 
	
	
	
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Paper • 2412.05939 • Published • 16 
- 
	
	
	
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 - 
	
	
	
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 - 
	
	
	
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 - 
	
	
	
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28 
- 
	
	
	
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 - 
	
	
	
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 - 
	
	
	
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 - 
	
	
	
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38 
- 
	
	
	
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper • 2410.13861 • Published • 56 - 
	
	
	
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Paper • 2411.07975 • Published • 30 - 
	
	
	
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 86 - 
	
	
	
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 46 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
Multimodal Self-Instruct: Synthetic Abstract Image and Visual Reasoning Instruction Using Language Model
Paper • 2407.07053 • Published • 47 - 
	
	
	
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Paper • 2407.12772 • Published • 35 - 
	
	
	
VLMEvalKit: An Open-Source Toolkit for Evaluating Large Multi-Modality Models
Paper • 2407.11691 • Published • 15 - 
	
	
	
MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models
Paper • 2408.02718 • Published • 62 
- 
	
	
	
Contrastive Learning for Many-to-many Multilingual Neural Machine Translation
Paper • 2105.09501 • Published - 
	
	
	
Cross-modal Contrastive Learning for Speech Translation
Paper • 2205.02444 • Published - 
	
	
	
ByteTransformer: A High-Performance Transformer Boosted for Variable-Length Inputs
Paper • 2210.03052 • Published - 
	
	
	
Diffusion Glancing Transformer for Parallel Sequence to Sequence Learning
Paper • 2212.10240 • Published • 1 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
ComplexFuncBench: Exploring Multi-Step and Constrained Function Calling under Long-Context Scenario
Paper • 2501.10132 • Published • 22 
- 
	
	
	
Evolving Deeper LLM Thinking
Paper • 2501.09891 • Published • 115 - 
	
	
	
PaSa: An LLM Agent for Comprehensive Academic Paper Search
Paper • 2501.10120 • Published • 52 - 
	
	
	
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong
Paper • 2501.09775 • Published • 33 - 
	
	
	
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28 
- 
	
	
	
MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models
Paper • 2501.02955 • Published • 44 - 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
MMVU: Measuring Expert-Level Multi-Discipline Video Understanding
Paper • 2501.12380 • Published • 85 - 
	
	
	
VideoWorld: Exploring Knowledge Learning from Unlabeled Videos
Paper • 2501.09781 • Published • 28 
- 
	
	
	
Video Creation by Demonstration
Paper • 2412.09551 • Published • 9 - 
	
	
	
DiffSensei: Bridging Multi-Modal LLMs and Diffusion Models for Customized Manga Generation
Paper • 2412.07589 • Published • 48 - 
	
	
	
Unraveling the Complexity of Memory in RL Agents: an Approach for Classification and Evaluation
Paper • 2412.06531 • Published • 72 - 
	
	
	
APOLLO: SGD-like Memory, AdamW-level Performance
Paper • 2412.05270 • Published • 38 
- 
	
	
	
CompCap: Improving Multimodal Large Language Models with Composite Captions
Paper • 2412.05243 • Published • 20 - 
	
	
	
LiFT: Leveraging Human Feedback for Text-to-Video Model Alignment
Paper • 2412.04814 • Published • 47 - 
	
	
	
MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Paper • 2412.05237 • Published • 46 - 
	
	
	
Exploring Multi-Grained Concept Annotations for Multimodal Large Language Models
Paper • 2412.05939 • Published • 16 
- 
	
	
	
PUMA: Empowering Unified MLLM with Multi-granular Visual Generation
Paper • 2410.13861 • Published • 56 - 
	
	
	
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation
Paper • 2411.07975 • Published • 30 - 
	
	
	
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
Paper • 2411.10442 • Published • 86 - 
	
	
	
Multimodal Autoregressive Pre-training of Large Vision Encoders
Paper • 2411.14402 • Published • 46