Papers: Evaluation
updated
Skill-Mix: a Flexible and Expandable Family of Evaluations for AI models
Paper
• 2310.17567
• Published • 1
This is not a Dataset: A Large Negation Benchmark to Challenge Large
Language Models
Paper
• 2310.15941
• Published • 6
Holistic Evaluation of Language Models
Paper
• 2211.09110
• Published • 1
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large
Language Models
Paper
• 2306.04757
• Published • 5
EleutherAI: Going Beyond "Open Science" to "Science in the Open"
Paper
• 2210.06413
• Published
Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models
Paper
• 2310.20499
• Published • 8
MEGAVERSE: Benchmarking Large Language Models Across Languages,
Modalities, Models and Tasks
Paper
• 2311.07463
• Published • 15
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena
Paper
• 2306.05685
• Published • 40
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue
Summarization
Paper
• 2402.13249
• Published • 15