Budget-aware Test-time Scaling via Discriminative Verification Paper • 2510.14913 • Published 15 days ago • 4
Predicting Task Performance with Context-aware Scaling Laws Paper • 2510.14919 • Published 15 days ago • 3
JudgeBench: A Benchmark for Evaluating LLM-based Judges Paper • 2410.12784 • Published Oct 16, 2024 • 48