FlagEval Findings Report: A Preliminary Evaluation of Large Reasoning Models on Automatically Verifiable Textual and Visual Questions
Abstract
A contamination-free evaluation of large reasoning models is conducted using the ROME benchmark, which tests reasoning from visual clues in vision language models.
We conduct a moderate-scale contamination-free (to some extent) evaluation of current large reasoning models (LRMs) with some preliminary findings. We also release ROME, our evaluation benchmark for vision language models intended to test reasoning from visual clues. We attach links to the benchmark, evaluation data, and other updates on this website: https://flageval-baai.github.io/LRM-Eval/
Community
It's an evaluation framework focused on capability × alignment × safety × efficiency, together with ROME, the new benchmark built for visual reasoning, for guiding decisions on model choice and risk.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Investigating Advanced Reasoning of Large Language Models via Black-Box Interaction (2025)
- Safe Semantics, Unsafe Interpretations: Tackling Implicit Reasoning Safety in Large Vision-Language Models (2025)
- Automated MCQA Benchmarking at Scale: Evaluating Reasoning Traces as Retrieval Sources for Domain Adaptation of Small Language Models (2025)
- BLUEX Revisited: Enhancing Benchmark Coverage with Automatic Captioning (2025)
- Dynamic Benchmark Construction for Evaluating Large Language Models on Real-World Codes (2025)
- What is an "Abstract Reasoner"? Revisiting Experiments and Arguments about Large Language Models (2025)
- Towards Reliable and Interpretable Document Question Answering via VLMs (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper