One Forward is Enough for Neural Network Training via Likelihood Ratio Method Paper • 2305.08960 • Published May 15, 2023
Video Understanding with Large Language Models: A Survey Paper • 2312.17432 • Published Dec 29, 2023 • 3
VidComposition: Can MLLMs Analyze Compositions in Compiled Videos? Paper • 2411.10979 • Published Nov 17, 2024
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See Paper • 2410.06169 • Published Oct 8, 2024
From 16-Bit to 1-Bit: Visual KV Cache Quantization for Memory-Efficient Multimodal Large Language Models Paper • 2502.14882 • Published Feb 15
Caption Anything in Video: Fine-grained Object-centric Captioning via Spatiotemporal Multimodal Prompting Paper • 2504.05541 • Published Apr 7 • 15
MMPerspective: Do MLLMs Understand Perspective? A Comprehensive Benchmark for Perspective Perception, Reasoning, and Robustness Paper • 2505.20426 • Published May 26 • 7
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models Paper • 2510.05034 • Published 28 days ago • 46