Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Abstract
Multi-Crit evaluates multimodal models on following diverse criteria with metrics for pluralistic adherence, criterion-switching flexibility, and recognizing preference conflicts, revealing gaps in model capabilities.
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.
Community
Happy to share our recent work on benchmarking large multimodal models (LMMs) as judges in their ability to follow pluralistic, fine-grained evaluation criteria. Our paper is titled “Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following.”
Paper: https://arxiv.org/abs/2511.21662
Project Page: https://multi-crit.github.io/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MM-CRITIC: A Holistic Evaluation of Large Multimodal Models as Multimodal Critique (2025)
- Deploying Tiny LVLM Judges for Real-World Evaluation of Chart Models: Lessons Learned and Best Practices (2025)
- CLASH: A Benchmark for Cross-Modal Contradiction Detection (2025)
- Incentivizing Agentic Reasoning in LLM Judges via Tool-Integrated Reinforcement Learning (2025)
- Auto-Prompt Ensemble for LLM Judge (2025)
- MDSEval: A Meta-Evaluation Benchmark for Multimodal Dialogue Summarization (2025)
- mR3: Multilingual Rubric-Agnostic Reward Reasoning Models (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper