AMO-Bench: Large Language Models Still Struggle in High School Math Competitions Paper • 2510.26768 • Published 1 day ago • 26
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution Paper • 2510.25726 • Published 2 days ago • 39
Tool-integrated Reinforcement Learning for Repo Deep Search Paper • 2508.03012 • Published Aug 5 • 20
SoRFT: Issue Resolving with Subtask-oriented Reinforced Fine-Tuning Paper • 2502.20127 • Published Feb 27 • 9