DriTrove
			's Collections
			 
		
			
				
				
	
	
	
			
			Exploring the Potential of Encoder-free Architectures in 3D LMMs
		
			Paper
			
•
			2502.09620
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			The Evolution of Multimodal Model Architectures
		
			Paper
			
•
			2405.17927
			
•
			Published
				
			•
				
				1
			
 
	
	 
	
	
	
			
			What matters when building vision-language models?
		
			Paper
			
•
			2405.02246
			
•
			Published
				
			•
				
				103
			
 
	
	 
	
	
	
			
			Efficient Architectures for High Resolution Vision-Language Models
		
			Paper
			
•
			2501.02584
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			Building and better understanding vision-language models: insights and
  future directions
		
			Paper
			
•
			2408.12637
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			Improving Fine-grained Visual Understanding in VLMs through Text-Only
  Training
		
			Paper
			
•
			2412.12940
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			VILA: On Pre-training for Visual Language Models
		
			Paper
			
•
			2312.07533
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Renaissance: Investigating the Pretraining of Vision-Language Encoders
		
			Paper
			
•
			2411.06657
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			Exploring the Frontier of Vision-Language Models: A Survey of Current
  Methodologies and Future Directions
		
			Paper
			
•
			2404.07214
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			NanoVLMs: How small can we go and still make coherent Vision Language
  Models?
		
			Paper
			
•
			2502.07838
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			POINTS: Improving Your Vision-language Model with Affordable Strategies
		
			Paper
			
•
			2409.04828
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Unveiling Encoder-Free Vision-Language Models
		
			Paper
			
•
			2406.11832
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			Efficient Vision-Language Models by Summarizing Visual Tokens into
  Compact Registers
		
			Paper
			
•
			2410.14072
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One
  Vision Token
		
			Paper
			
•
			2501.03895
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
		
			Paper
			
•
			2402.03766
			
•
			Published
				
			•
				
				15
			
 
	
	 
	
	
	
				HuggingFaceTB/SmolVLM-256M-Instruct
				
				
			
			Image-Text-to-Text
			
• 
		
				0.3B
			• 
	
				Updated
					
				
				• 
					
					158k
				
	
				
• 
					
					296
				
 
		
	
	
	 
	
	
	
				Qwen/Qwen2.5-VL-3B-Instruct
				
				
			
			Image-Text-to-Text
			
• 
		
				4B
			• 
	
				Updated
					
				
				• 
					
					7.87M
				
	
				
• 
					
					545
				
 
		
	
	
	 
	
	
	
			
			PaliGemma: A versatile 3B VLM for transfer
		
			Paper
			
•
			2407.07726
			
•
			Published
				
			•
				
				72
			
 
	
	 
	
	
	
				MILVLG/imp-v1-3b
				
				
			
			Text Generation
			
• 
		
				3B
			• 
	
				Updated
					
				
				• 
					
					232
				
	
				
• 
					
					201
				
 
		
	
	
	 
	
	
	
				marianna13/llava-phi-2-3b
				
				
			
			Text Generation
			
• 
		
				3B
			• 
	
				Updated
					
				
				• 
					
					45
				
	
				
• 
					
					13
				
 
		
	
	
	 
	
	
	
			
			BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large
  Language Models on Mobile Devices
		
			Paper
			
•
			2411.10640
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Scalable Vision Language Model Training via High Quality Data Curation
		
			Paper
			
•
			2501.05952
			
•
			Published
				
			•
				
				5
			
 
	
	 
	
	
	
			
			Mini-Gemini: Mining the Potential of Multi-modality Vision Language
  Models
		
			Paper
			
•
			2403.18814
			
•
			Published
				
			•
				
				47
			
 
	
	 
	
	
	
			
			VisionZip: Longer is Better but Not Necessary in Vision Language Models
		
			Paper
			
•
			2412.04467
			
•
			Published
				
			•
				
				118
			
 
	
	 
	
	
	
			
			Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at
  Any Resolution
		
			Paper
			
•
			2409.12191
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs
		
			Paper
			
•
			2401.06209
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			Model Composition for Multimodal Large Language Models
		
			Paper
			
•
			2402.12750
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			A Review of Multi-Modal Large Language and Vision Models
		
			Paper
			
•
			2404.01322
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			The (R)Evolution of Multimodal Large Language Models: A Survey
		
			Paper
			
•
			2402.12451
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
		
			Paper
			
•
			2312.16862
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Survey of Large Multimodal Model Datasets, Application Categories and
  Taxonomy
		
			Paper
			
•
			2412.17759
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			TinyLLaVA: A Framework of Small-scale Large Multimodal Models
		
			Paper
			
•
			2402.14289
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Towards Multi-Modal Mastery: A 4.5B Parameter Truly Multi-Modal Small
  Language Model
		
			Paper
			
•
			2411.05903
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			Boosting the Power of Small Multimodal Reasoning Models to Match Larger
  Models with Self-Consistency Training
		
			Paper
			
•
			2311.14109
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			TinyLLaVA-Video: A Simple Framework of Small-scale Large Multimodal
  Models for Video Understanding
		
			Paper
			
•
			2501.15513
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			LLaVA-φ: Efficient Multi-Modal Assistant with Small Language Model
		
			Paper
			
•
			2401.02330
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			MM-LLMs: Recent Advances in MultiModal Large Language Models
		
			Paper
			
•
			2401.13601
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5%
  Parameters and 90% Performance
		
			Paper
			
•
			2410.16261
			
•
			Published
				
			•
				
				4
			
 
	
	 
	
	
	
			
			Vision-Language Models for Edge Networks: A Comprehensive Survey
		
			Paper
			
•
			2502.07855
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			Self-Adapting Large Visual-Language Models to Edge Devices across Visual
  Modalities
		
			Paper
			
•
			2403.04908
			
•
			Published
				
			
			 
	
	 
	
	
	
				google/paligemma2-3b-mix-448
				
				
			
			Image-Text-to-Text
			
• 
		
				3B
			• 
	
				Updated
					
				
				• 
					
					5.08k
				
	
				
• 
					
					51
				
 
		
	
	
	 
	
	
	
			
			LLaVA-o1: Let Vision Language Models Reason Step-by-Step
		
			Paper
			
•
			2411.10440
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			InfiR : Crafting Effective Small Language Models and Multimodal Small
  Language Models in Reasoning
		
			Paper
			
•
			2502.11573
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			Small Vision-Language Models: A Survey on Compact Architectures and
  Techniques
		
			Paper
			
•
			2503.10665
			
•
			Published
				
			
			 
	
	 
	
	
	
			
			TIPS: Text-Image Pretraining with Spatial Awareness
		
			Paper
			
•
			2410.16512
			
•
			Published
				
			•
				
				3