Evaluation - a kkish Collection

kkish 's Collections

Seed Flagship Model Released

RLM

Evaluation

updated 9 days ago

NL2Repo-Bench: Towards Long-Horizon Repository Generation Evaluation of Coding Agents

Paper • 2512.12730 • Published 11 days ago • 43
GPT-Fathom: Benchmarking Large Language Models to Decipher the Evolutionary Path towards GPT-4 and Beyond

Paper • 2309.16583 • Published Sep 28, 2023 • 12