BenCzechMark-unstable

Running

mfajcik commited on Oct 1, 2024

Commit

dab68b3

verified ·

1 Parent(s): 7d29744

Update content.py

Files changed (1) hide show

content.py CHANGED Viewed

@@ -12,8 +12,8 @@ Here you can compare models on tasks in Czech language and/or submit your own mo
 - See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
 - __How scoring works__:
   - On each task, the __Duel Win Score__ reports proportion of won duels.
-  - Category scores are obtained by averaging across category tasks.
-  - __Average__ Duel Win Scores are an average over category scores.
 - All public submissions are shared in [CZLC/LLM_benchmark_data](https://huggingface.co/datasets/CZLC/LLM_benchmark_data) dataset.
 - In submission page, __you can obtain results on leaderboard without publishing them__.
     - First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.

 - See **About** page for brief description of our evaluation protocol & win score mechanism, citation information, and future directions for this benchmark.
 - __How scoring works__:
   - On each task, the __Duel Win Score__ reports proportion of won duels.
+  - Category scores are obtained by averaging across category tasks. When selecting a category (other then Overall), the "Average" column shows Category Duel Win Scores.
+  - __Overall__ Duel Win Scores are an average over category scores. When selecting Overall category, the "Average" column shows Overall Duel Win Score.
 - All public submissions are shared in [CZLC/LLM_benchmark_data](https://huggingface.co/datasets/CZLC/LLM_benchmark_data) dataset.
 - In submission page, __you can obtain results on leaderboard without publishing them__.
     - First step is "pre-submission", and after this is done (significance tests can take up to an hour), the results can be submitted if you'd like to.