Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence Paper • 2510.20470 • Published 11 days ago • 11
TimeChat-Online: 80% Visual Tokens are Naturally Redundant in Streaming Videos Paper • 2504.17343 • Published Apr 24 • 13
TEMPLE:Temporal Preference Learning of Video LLMs via Difficulty Scheduling and Pre-SFT Alignment Paper • 2503.16929 • Published Mar 21
RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction Paper • 2505.22613 • Published May 28 • 9
VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning? Paper • 2505.23359 • Published May 29 • 39
PunchBench: Benchmarking MLLMs in Multimodal Punchline Comprehension Paper • 2412.11906 • Published Dec 16, 2024
UVE: Are MLLMs Unified Evaluators for AI-Generated Videos? Paper • 2503.09949 • Published Mar 13 • 5
VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models Paper • 2311.17404 • Published Nov 29, 2023 • 1
TempCompass: Do Video LLMs Really Understand Videos? Paper • 2403.00476 • Published Mar 1, 2024 • 1
COST-EFF: Collaborative Optimization of Spatial and Temporal Efficiency with Slenderized Multi-exit Language Models Paper • 2210.15523 • Published Oct 27, 2022 • 1