arxiv:2509.17177

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Published on Sep 21

· Submitted by

Adina Yakefu on Sep 23

Upvote

Authors:

Jing-Shu Zheng ,

Miguel Hu Chen ,

Richeng Xuan ,

Shiqi Zhou ,

Teng Dai ,

Xi Yang ,

Yaming Liu ,

Abstract

A contamination-free evaluation of large reasoning models is conducted using the ROME benchmark, which tests reasoning from visual clues in vision language models.

AI-generated summary

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

View arXiv page View PDF Add to collection

Community

AdinaY

Paper submitter Sep 23

It's an evaluation framework focused on capability × alignment × safety × efficiency, together with ROME, the new benchmark built for visual reasoning, for guiding decisions on model choice and risk.