Papers
arxiv:2509.17177

FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions

Published on Sep 21
· Submitted by Adina Yakefu on Sep 23
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

A contamination-free evaluation of large reasoning models is conducted using the ROME benchmark, which tests reasoning from visual clues in vision language models.

AI-generated summary

We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/

Community

Paper submitter

It's an evaluation framework focused on capability × alignment × safety × efficiency, together with ROME, the new benchmark built for visual reasoning, for guiding decisions on model choice and risk.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.17177 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.17177 in a Space README.md to link it from this page.

Collections including this paper 3