flow2023
			's Collections
			 
		
			
		MLLM
		
	updated
			
 
				
				
	
	
	
			
			TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
		
			Paper
			
•
			2312.16862
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision,
  Language, Audio, and Action
		
			Paper
			
•
			2312.17172
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as
  Programmers
		
			Paper
			
•
			2401.01974
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations
		
			Paper
			
•
			2401.01885
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Self-Play Fine-Tuning Converts Weak Language Models to Strong Language
  Models
		
			Paper
			
•
			2401.01335
			
•
			Published
				
			•
				
				68
			
 
	
	 
	
	
	
			
			Improving Text Embeddings with Large Language Models
		
			Paper
			
•
			2401.00368
			
•
			Published
				
			•
				
				82
			
 
	
	 
	
	
	
			
			Distilling Vision-Language Models on Millions of Videos
		
			Paper
			
•
			2401.06129
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Bootstrapping LLM-based Task-Oriented Dialogue Agents via Self-Talk
		
			Paper
			
•
			2401.05033
			
•
			Published
				
			•
				
				18
			
 
	
	 
	
	
	
			
			LEGO:Language Enhanced Multi-modal Grounding Model
		
			Paper
			
•
			2401.06071
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
		
			Paper
			
•
			2401.04575
			
•
			Published
				
			•
				
				17
			
 
	
	 
	
	
	
			
			Narrowing the Knowledge Evaluation Gap: Open-Domain Question Answering
  with Multi-Granularity Answers
		
			Paper
			
•
			2401.04695
			
•
			Published
				
			•
				
				13
			
 
	
	 
	
	
	
		
			Paper
			
•
			2401.04088
			
•
			Published
				
			•
				
				159
			
 
	
	 
	
	
	
			
			Open-Vocabulary SAM: Segment and Recognize Twenty-thousand Classes
  Interactively
		
			Paper
			
•
			2401.02955
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Understanding LLMs: A Comprehensive Overview from Training to Inference
		
			Paper
			
•
			2401.02038
			
•
			Published
				
			•
				
				65
			
 
	
	 
	
	
	
			
			Can Large Language Models Understand Context?
		
			Paper
			
•
			2402.00858
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			StrokeNUWA: Tokenizing Strokes for Vector Graphic Synthesis
		
			Paper
			
•
			2401.17093
			
•
			Published
				
			•
				
				20
			
 
	
	 
	
	
	
			
			InternLM-XComposer2: Mastering Free-form Text-Image Composition and
  Comprehension in Vision-Language Large Model
		
			Paper
			
•
			2401.16420
			
•
			Published
				
			•
				
				55
			
 
	
	 
	
	
	
			
			MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
		
			Paper
			
•
			2401.15947
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			Overcoming the Pitfalls of Vision-Language Model Finetuning for OOD
  Generalization
		
			Paper
			
•
			2401.15914
			
•
			Published
				
			•
				
				7
			
 
	
	 
	
	
	
			
			MM-LLMs: Recent Advances in MultiModal Large Language Models
		
			Paper
			
•
			2401.13601
			
•
			Published
				
			•
				
				48
			
 
	
	 
	
	
	
			
			Small Language Model Meets with Reinforced Vision Vocabulary
		
			Paper
			
•
			2401.12503
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			Large Language Models are Superpositions of All Characters: Attaining
  Arbitrary Role-play via Self-Alignment
		
			Paper
			
•
			2401.12474
			
•
			Published
				
			•
				
				36
			
 
	
	 
	
	
	
			
			Spotting LLMs With Binoculars: Zero-Shot Detection of Machine-Generated
  Text
		
			Paper
			
•
			2401.12070
			
•
			Published
				
			•
				
				45
			
 
	
	 
	
	
	
			
			SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
  Capabilities
		
			Paper
			
•
			2401.12168
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			MM1: Methods, Analysis & Insights from Multimodal LLM Pre-training
		
			Paper
			
•
			2403.09611
			
•
			Published
				
			•
				
				129
			
 
	
	 
	
	
	
			
			DeepSeek-VL: Towards Real-World Vision-Language Understanding
		
			Paper
			
•
			2403.05525
			
•
			Published
				
			•
				
				46
			
 
	
	 
	
	
	
			
			Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
		
			Paper
			
•
			2403.04132
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			FinTral: A Family of GPT-4 Level Multimodal Financial Large Language
  Models
		
			Paper
			
•
			2402.10986
			
•
			Published
				
			•
				
				81
			
 
	
	 
	
	
	
			
			Linear Transformers with Learnable Kernel Functions are Better
  In-Context Models
		
			Paper
			
•
			2402.10644
			
•
			Published
				
			•
				
				81
			
 
	
	 
	
	
	
			
			TravelPlanner: A Benchmark for Real-World Planning with Language Agents
		
			Paper
			
•
			2402.01622
			
•
			Published
				
			•
				
				37
			
 
	
	 
	
	
	
			
			LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
		
			Paper
			
•
			2403.15042
			
•
			Published
				
			•
				
				27
			
 
	
	 
	
	
	
			
			When Do We Not Need Larger Vision Models?
		
			Paper
			
•
			2403.13043
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real
  Computer Environments
		
			Paper
			
•
			2404.07972
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Ferret-v2: An Improved Baseline for Referring and Grounding with Large
  Language Models
		
			Paper
			
•
			2404.07973
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			BRAVE: Broadening the visual encoding of vision-language models
		
			Paper
			
•
			2404.07204
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			SEED-X: Multimodal Models with Unified Multi-granularity Comprehension
  and Generation
		
			Paper
			
•
			2404.14396
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			PhysDreamer: Physics-Based Interaction with 3D Objects via Video
  Generation
		
			Paper
			
•
			2404.13026
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			Reka Core, Flash, and Edge: A Series of Powerful Multimodal Language
  Models
		
			Paper
			
•
			2404.12387
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			BLINK: Multimodal Large Language Models Can See but Not Perceive
		
			Paper
			
•
			2404.12390
			
•
			Published
				
			•
				
				26
			
 
	
	 
	
	
	
			
			Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation
		
			Paper
			
•
			2404.19752
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with
  Text-Rich Visual Comprehension
		
			Paper
			
•
			2404.16790
			
•
			Published
				
			•
				
				10
			
 
	
	 
	
	
	
			
			Many-Shot In-Context Learning in Multimodal Foundation Models
		
			Paper
			
•
			2405.09798
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			ShareGPT4Video: Improving Video Understanding and Generation with Better
  Captions
		
			Paper
			
•
			2406.04325
			
•
			Published
				
			•
				
				75
			
 
	
	 
	
	
	
			
			Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal
  Language Models
		
			Paper
			
•
			2406.09403
			
•
			Published
				
			•
				
				23
			
 
	
	 
	
	
	
			
			Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
		
			Paper
			
•
			2406.06469
			
•
			Published
				
			•
				
				29
			
 
	
	 
	
	
	
			
			Mixture-of-Agents Enhances Large Language Model Capabilities
		
			Paper
			
•
			2406.04692
			
•
			Published
				
			•
				
				59
			
 
	
	 
	
	
	
			
			GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
  Complex Reasoning Abilities
		
			Paper
			
•
			2406.11768
			
•
			Published
				
			•
				
				24
			
 
	
	 
	
	
	
			
			OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and
  Understanding
		
			Paper
			
•
			2406.19389
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			SeaKR: Self-aware Knowledge Retrieval for Adaptive Retrieval Augmented
  Generation
		
			Paper
			
•
			2406.19215
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning
		
			Paper
			
•
			2406.15334
			
•
			Published
				
			•
				
				9
			
 
	
	 
	
	
	
			
			InternLM-XComposer-2.5: A Versatile Large Vision Language Model
  Supporting Long-Contextual Input and Output
		
			Paper
			
•
			2407.03320
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			FunAudioLLM: Voice Understanding and Generation Foundation Models for
  Natural Interaction Between Humans and LLMs
		
			Paper
			
•
			2407.04051
			
•
			Published
				
			•
				
				39
			
 
	
	 
	
	
	
			
			HEMM: Holistic Evaluation of Multimodal Foundation Models
		
			Paper
			
•
			2407.03418
			
•
			Published
				
			•
				
				12
			
 
	
	 
	
	
	
			
			Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for
  Sparse Architectural Large Language Models
		
			Paper
			
•
			2407.01906
			
•
			Published
				
			•
				
				43
			
 
	
	 
	
	
	
			
			VITA: Towards Open-Source Interactive Omni Multimodal LLM
		
			Paper
			
•
			2408.05211
			
•
			Published
				
			•
				
				50
			
 
	
	 
	
	
	
			
			Task-oriented Sequential Grounding in 3D Scenes
		
			Paper
			
•
			2408.04034
			
•
			Published
				
			•
				
				8
			
 
	
	 
	
	
	
			
			Show-o: One Single Transformer to Unify Multimodal Understanding and
  Generation
		
			Paper
			
•
			2408.12528
			
•
			Published
				
			•
				
				51
			
 
	
	 
	
	
	
			
			Law of Vision Representation in MLLMs
		
			Paper
			
•
			2408.16357
			
•
			Published
				
			•
				
				95
			
 
	
	 
	
	
	
			
			CogVLM2: Visual Language Models for Image and Video Understanding
		
			Paper
			
•
			2408.16500
			
•
			Published
				
			•
				
				57
			
 
	
	 
	
	
	
			
			Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of
  Encoders
		
			Paper
			
•
			2408.15998
			
•
			Published
				
			•
				
				87
			
 
	
	 
	
	
	
			
			LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
		
			Paper
			
•
			2408.15881
			
•
			Published
				
			•
				
				21
			
 
	
	 
	
	
	
			
			Building and better understanding vision-language models: insights and
  future directions
		
			Paper
			
•
			2408.12637
			
•
			Published
				
			•
				
				133
			
 
	
	 
	
	
	
			
			OneGen: Efficient One-Pass Unified Generation and Retrieval for LLMs
		
			Paper
			
•
			2409.05152
			
•
			Published
				
			•
				
				32
			
 
	
	 
	
	
	
			
			LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via
  Hybrid Architecture
		
			Paper
			
•
			2409.02889
			
•
			Published
				
			•
				
				54
			
 
	
	 
	
	
	
			
			OLMoE: Open Mixture-of-Experts Language Models
		
			Paper
			
•
			2409.02060
			
•
			Published
				
			•
				
				78
			
 
	
	 
	
	
	
			
			Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
		
			Paper
			
•
			2408.16725
			
•
			Published
				
			•
				
				52
			
 
	
	 
	
	
	
			
			MIO: A Foundation Model on Multimodal Tokens
		
			Paper
			
•
			2409.17692
			
•
			Published
				
			•
				
				53
			
 
	
	 
	
	
	
			
			Aria: An Open Multimodal Native Mixture-of-Experts Model
		
			Paper
			
•
			2410.05993
			
•
			Published
				
			•
				
				111
			
 
	
	 
	
	
	
			
			MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
		
			Paper
			
•
			2410.19168
			
•
			Published
				
			•
				
				23