- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
Collections
Discover the best community collections!
Collections including paper arxiv:2501.04575 
						
					
				- 
	
	
	
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Paper • 2504.14239 • Published • 13 - 
	
	
	
				InfiX-ai/InfiGUI-R1-3B
Image-Text-to-Text • 4B • Updated • 217 • 6 - 
	
	
	
InfiX-ai/android_control_train
Viewer • Updated • 13.6k • 41 - 
	
	
	
InfiX-ai/android_control_test
Updated • 42 • 1 
- 
	
	
	
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Paper • 2501.04519 • Published • 285 - 
	
	
	
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
Paper • 2501.04682 • Published • 99 - 
	
	
	
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper • 2501.05366 • Published • 102 - 
	
	
	
Agent Laboratory: Using LLM Agents as Research Assistants
Paper • 2501.04227 • Published • 94 
- 
	
	
	
				xlangai/Aguvis-7B-720P
8B • Updated • 50 • 9 - 
	
	
	
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 71 - 
	
	
	
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper • 2401.10935 • Published • 5 - 
	
	
	
cckevinn/SeeClick
Text Generation • 10B • Updated • 166 • 18 
- 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Paper • 2501.01257 • Published • 52 - 
	
	
	
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Paper • 2501.01423 • Published • 43 - 
	
	
	
REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Paper • 2411.13552 • Published 
- 
	
	
	
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Paper • 2402.04252 • Published • 28 - 
	
	
	
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models
Paper • 2402.03749 • Published • 14 - 
	
	
	
ScreenAI: A Vision-Language Model for UI and Infographics Understanding
Paper • 2402.04615 • Published • 44 - 
	
	
	
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss
Paper • 2402.05008 • Published • 23 
- 
	
	
	
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Paper • 2504.14239 • Published • 13 - 
	
	
	
				InfiX-ai/InfiGUI-R1-3B
Image-Text-to-Text • 4B • Updated • 217 • 6 - 
	
	
	
InfiX-ai/android_control_train
Viewer • Updated • 13.6k • 41 - 
	
	
	
InfiX-ai/android_control_test
Updated • 42 • 1 
- 
	
	
	
rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Paper • 2501.04519 • Published • 285 - 
	
	
	
Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
Paper • 2501.04682 • Published • 99 - 
	
	
	
Search-o1: Agentic Search-Enhanced Large Reasoning Models
Paper • 2501.05366 • Published • 102 - 
	
	
	
Agent Laboratory: Using LLM Agents as Research Assistants
Paper • 2501.04227 • Published • 94 
- 
	
	
	
				xlangai/Aguvis-7B-720P
8B • Updated • 50 • 9 - 
	
	
	
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Paper • 2412.04454 • Published • 71 - 
	
	
	
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Paper • 2401.10935 • Published • 5 - 
	
	
	
cckevinn/SeeClick
Text Generation • 10B • Updated • 166 • 18 
- 
	
	
	
2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining
Paper • 2501.00958 • Published • 107 - 
	
	
	
CodeElo: Benchmarking Competition-level Code Generation of LLMs with Human-comparable Elo Ratings
Paper • 2501.01257 • Published • 52 - 
	
	
	
Reconstruction vs. Generation: Taming Optimization Dilemma in Latent Diffusion Models
Paper • 2501.01423 • Published • 43 - 
	
	
	
REDUCIO! Generating 1024times1024 Video within 16 Seconds using Extremely Compressed Motion Latents
Paper • 2411.13552 • Published