Arabic AI Benchmarks and Leaderboards
Over the past year, numerous benchmarks have been conducted to test various aspects of Arabic AI technologies, including LLM performance, Multimodality/Vision, Embedding, Retrieval, RAG Generation, SST, and OCR. This post serves as a comprehensive record of all benchmarks and leaderboards within the Arabic AI ecosystem. Our goal is to provide a centralized resource for the community to easily access and identify the appropriate benchmark for their evaluation tasks or to choose the top model for a specific task.
Leaderboards
Below is a list of leaderboards testing various aspects of Arabic AI Models
LLM Performance
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Open Arabic LLM Leaderboard (OALL) v2 | General Knowledge, MMLU, Grammar, RAG Generation, Trust & Safety, Sentiment Analysis & Dialects | https://huggingface.co/spaces/OALL/Open-Arabic-LLM-Leaderboard | v1 legacy |
| Arabic-Leaderboards | IFEval, Question Answering, Orthographic and Grammatical Analysis, Reasoning, Safety | https://huggingface.co/spaces/inceptionai/Arabic-Leaderboards | Closed datasets (except IFEval) |
| Scale Seal | Coding, Creative, Educational Support, Idea Development,Writing & Communication and others | https://scale.com/leaderboard/arabic | Closed datasets, evaluated manually by human experts |
| Arabic Broad Leaderboard (ABL) | Comprehensive evaluation of the Arabic language through testing proficiency in 22 skills and categories | https://huggingface.co/spaces/silma-ai/Arabic-LLM-Broad-Leaderboard | Includes visualizations, analytical capabilities, model skill breakdowns, speed comparisons, and contamination detection mechanisms |
| Islamic MMLU | Islamic knowledge | https://huggingface.co/spaces/islamicmmlu/leaderboard | Islamic knowledge |
Embeddings
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| MTEB (Legacy) | General embedding (Sentence to Sentence) | https://huggingface.co/spaces/mteb/leaderboard_legacy | You will need to click on STS -> Other -> then sort STS17 (ar-ar) column descending |
| The Arabic RAG Leaderboard | Retrieval and Re-ranking | https://huggingface.co/spaces/Navid-AI/The-Arabic-Rag-Leaderboard | Adding RAG Generation component is planned |
Vision / OCR
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| CAMEL-Bench | Vision understanding, OCR, chart understanding, video, medical imaging, and more | https://huggingface.co/spaces/ahmedheakl/CAMEL-Bench-leaderboard |
Speech
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Open Universal Arabic ASR Leaderboard | multi-dialect Arabic ASR | https://huggingface.co/spaces/elmresearchcenter/open_universal_arabic_asr_leaderboard | |
| Arabic TTS Benchmark | Arabic TTS Models | https://huggingface.co/spaces/silma-ai/arabic-tts-benchmark | |
| Open-source Arabic TTS Benchmark | Open-source Arabic TTS Models | https://huggingface.co/spaces/silma-ai/opensource-arabic-tts-benchmark | |
| Arabic TTS Arena | Arabic TTS Models | https://huggingface.co/spaces/Navid-AI/Arabic-TTS-Arena |
Tokenizers
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Arabic Tokenizers Leaderboard | Tokenizer efficiency via fertility score | https://huggingface.co/spaces/MohamedRashad/arabic-tokenizers-leaderboard |
Tashkeel
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Arabic Tashkeel Space | Arabic Tashkeel (diacritization) Models | https://huggingface.co/spaces/MohamedRashad/arabic-auto-tashkeel |
Benchmarking datasets
Below is a non-comprehensive list of benchmarking dataset, it will grow by time.
Note:There are numerous research datasets available for benchmarking purposes, but in this list, we will focus on the most popular ones and the datasets which are commonly used in research papers to evaluate Arabic models.
General purpose
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Balsam Index | many tasks | https://benchmarks.ksaa.gov.sa/b/balsam/tasks | Data quality issues |
RAG
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| SILMA RAGQA v1.0 | 17 bilingual datasets in Arabic and English, spanning various domains | https://huggingface.co/datasets/silma-ai/silma-rag-qa-benchmark-v1.0 |
OCR
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| KITAB-Bench | handwritten text, structured tables, and specialized coverage of 21 chart types for business intelligence | https://huggingface.co/collections/ahmedheakl/kitab-bench-677dd5d88d5db344d5595b78 |
MMLU Arabic
| Name | What does it evaluate? | Link | Comments |
|---|---|---|---|
| Global MMLU | MMLU | https://huggingface.co/datasets/CohereForAI/Global-MMLU/viewer/ar | |
| Arabic MMLU | https://huggingface.co/datasets/MBZUAI/ArabicMMLU?row=0 | multi-task language understanding benchmark for Arabic language, sourced from school exams across diverse educational levels in different countries spanning North Africa, the Levant, and the Gulf regions |
Benchmark is missing?
If you believe that a benchmark or leaderboard is not included in the list, please leave a comment below so we can consider adding it.
