ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows Paper • 2505.19897 • Published May 26 • 104
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era Paper • 2503.12329 • Published Mar 16 • 27
Vision-Language Models Can Self-Improve Reasoning via Reflection Paper • 2411.00855 • Published Oct 30, 2024 • 5