CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents Paper • 2511.02734 • Published 26 days ago • 20
CiteGuard: Faithful Citation Attribution for LLMs via Retrieval-Augmented Validation Paper • 2510.17853 • Published Oct 15 • 7
AdaCtrl: Towards Adaptive and Controllable Reasoning via Difficulty-Aware Budgeting Paper • 2505.18822 • Published May 24 • 15
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios Paper • 2509.21766 • Published Sep 26 • 23
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers Paper • 2506.23918 • Published Jun 30 • 88
MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly Paper • 2505.10610 • Published May 15 • 54
VLM^2-Bench: A Closer Look at How Well VLMs Implicitly Link Explicit Matching Visual Cues Paper • 2502.12084 • Published Feb 17 • 32
MentalArena: Self-play Training of Language Models for Diagnosis and Treatment of Mental Health Disorders Paper • 2410.06845 • Published Oct 9, 2024 • 5