🔥 Starting benchmark for meta-llama_Llama-3.2-1B-Instruct Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 2 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-3.2-1B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (2) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3380|± |0.0150| |anli_r2 | 1|none | 0|acc |↑ | 0.3340|± |0.0149| |anli_r3 | 1|none | 0|acc |↑ | 0.3725|± |0.0140| |arc_challenge | 1|none | 0|acc |↑ | 0.3567|± |0.0140| | | |none | 0|acc_norm |↑ | 0.3805|± |0.0142| |bbh | 3|get-answer | |exact_match|↑ | 0.3781|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4225|± |0.0362| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5320|± |0.0316| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0280|± |0.0105| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2320|± |0.0268| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1440|± |0.0222| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.3680|± |0.0306| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.4720|± |0.0316| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.2400|± |0.0271| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6080|± |0.0309| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5320|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4315|± |0.0411| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.2720|± |0.0282| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2320|± |0.0268| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4607|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2160|± |0.0261| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1840|± |0.0246| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.0960|± |0.0187| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3080|± |0.0293| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0760|± |0.0168| |boolq | 2|none | 0|acc |↑ | 0.6948|± |0.0081| |drop | 3|none | 0|em |↑ | 0.0497|± |0.0022| | | |none | 0|f1 |↑ | 0.1635|± |0.0029| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1111|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1010|± |0.0215| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2727|± |0.0317| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2576|± |0.0312| | | |none | 0|acc_norm |↑ | 0.2576|± |0.0312| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2121|± |0.0291| | | |none | 0|acc_norm |↑ | 0.2121|± |0.0291| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1007|± |0.0129| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0934|± |0.0125| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2198|± |0.0177| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2656|± |0.0189| | | |none | 0|acc_norm |↑ | 0.2656|± |0.0189| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2802|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2802|± |0.0192| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0982|± |0.0141| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0982|± |0.0141| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2321|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2790|± |0.0212| | | |none | 0|acc_norm |↑ | 0.2790|± |0.0212| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2746|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2746|± |0.0211| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.3374|± |0.0130| | | |strict-match | 5|exact_match|↑ | 0.3374|± |0.0130| |hellaswag | 1|none | 0|acc |↑ | 0.4512|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6088|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.4589|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4389|± |0.0071| | - formal_logic | 1|none | 0|acc |↑ | 0.3095|± |0.0413| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6242|± |0.0378| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5784|± |0.0347| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6540|± |0.0310| | - international_law | 1|none | 0|acc |↑ | 0.5950|± |0.0448| | - jurisprudence | 1|none | 0|acc |↑ | 0.5185|± |0.0483| | - logical_fallacies | 1|none | 0|acc |↑ | 0.4540|± |0.0391| | - moral_disputes | 1|none | 0|acc |↑ | 0.4595|± |0.0268| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3318|± |0.0157| | - philosophy | 1|none | 0|acc |↑ | 0.5177|± |0.0284| | - prehistory | 1|none | 0|acc |↑ | 0.5278|± |0.0278| | - professional_law | 1|none | 0|acc |↑ | 0.3651|± |0.0123| | - world_religions | 1|none | 0|acc |↑ | 0.5848|± |0.0378| | - other | 2|none | |acc |↑ | 0.5182|± |0.0088| | - business_ethics | 1|none | 0|acc |↑ | 0.4500|± |0.0500| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.4679|± |0.0307| | - college_medicine | 1|none | 0|acc |↑ | 0.3815|± |0.0370| | - global_facts | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - human_aging | 1|none | 0|acc |↑ | 0.5381|± |0.0335| | - management | 1|none | 0|acc |↑ | 0.5340|± |0.0494| | - marketing | 1|none | 0|acc |↑ | 0.6795|± |0.0306| | - medical_genetics | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - miscellaneous | 1|none | 0|acc |↑ | 0.6003|± |0.0175| | - nutrition | 1|none | 0|acc |↑ | 0.5588|± |0.0284| | - professional_accounting | 1|none | 0|acc |↑ | 0.3546|± |0.0285| | - professional_medicine | 1|none | 0|acc |↑ | 0.5588|± |0.0302| | - virology | 1|none | 0|acc |↑ | 0.4157|± |0.0384| | - social sciences | 2|none | |acc |↑ | 0.5080|± |0.0088| | - econometrics | 1|none | 0|acc |↑ | 0.2281|± |0.0395| | - high_school_geography | 1|none | 0|acc |↑ | 0.5556|± |0.0354| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.5181|± |0.0361| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4103|± |0.0249| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4538|± |0.0323| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6294|± |0.0207| | - human_sexuality | 1|none | 0|acc |↑ | 0.5344|± |0.0437| | - professional_psychology | 1|none | 0|acc |↑ | 0.4265|± |0.0200| | - public_relations | 1|none | 0|acc |↑ | 0.4727|± |0.0478| | - security_studies | 1|none | 0|acc |↑ | 0.5388|± |0.0319| | - sociology | 1|none | 0|acc |↑ | 0.6468|± |0.0338| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - stem | 2|none | |acc |↑ | 0.3825|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2400|± |0.0429| | - anatomy | 1|none | 0|acc |↑ | 0.4815|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.5395|± |0.0406| | - college_biology | 1|none | 0|acc |↑ | 0.4931|± |0.0418| | - college_chemistry | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - college_computer_science | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - college_mathematics | 1|none | 0|acc |↑ | 0.2800|± |0.0451| | - college_physics | 1|none | 0|acc |↑ | 0.2549|± |0.0434| | - computer_security | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4340|± |0.0324| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5448|± |0.0415| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.2910|± |0.0234| | - high_school_biology | 1|none | 0|acc |↑ | 0.4968|± |0.0284| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3547|± |0.0337| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2630|± |0.0268| | - high_school_physics | 1|none | 0|acc |↑ | 0.2980|± |0.0373| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3472|± |0.0325| | - machine_learning | 1|none | 0|acc |↑ | 0.3125|± |0.0440| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0565|± |0.0038| |openbookqa | 1|none | 0|acc |↑ | 0.2440|± |0.0192| | | |none | 0|acc_norm |↑ | 0.3460|± |0.0213| |piqa | 1|none | 0|acc |↑ | 0.7437|± |0.0102| | | |none | 0|acc_norm |↑ | 0.7421|± |0.0102| |qnli | 1|none | 0|acc |↑ | 0.4946|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9390|± |0.0076| | | |none | 0|acc_norm |↑ | 0.8970|± |0.0096| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2499|± |0.0032| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3647|± |0.0169| | | |none | 0|bleu_diff |↑ |-2.3149|± |0.6149| | | |none | 0|bleu_max |↑ |17.3007|± |0.6267| | | |none | 0|rouge1_acc |↑ | 0.3550|± |0.0168| | | |none | 0|rouge1_diff|↑ |-4.7074|± |0.9203| | | |none | 0|rouge1_max |↑ |38.8035|± |0.8508| | | |none | 0|rouge2_acc |↑ | 0.2411|± |0.0150| | | |none | 0|rouge2_diff|↑ |-4.5701|± |0.9279| | | |none | 0|rouge2_max |↑ |22.7998|± |0.8999| | | |none | 0|rougeL_acc |↑ | 0.3672|± |0.0169| | | |none | 0|rougeL_diff|↑ |-4.7792|± |0.9300| | | |none | 0|rougeL_max |↑ |36.3004|± |0.8466| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2717|± |0.0156| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4383|± |0.0144| |winogrande | 1|none | 0|acc |↑ | 0.6014|± |0.0138| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.3781|± |0.0055| |mmlu | 2|none | |acc |↑ |0.4589|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4389|± |0.0071| | - other | 2|none | |acc |↑ |0.5182|± |0.0088| | - social sciences| 2|none | |acc |↑ |0.5080|± |0.0088| | - stem | 2|none | |acc |↑ |0.3825|± |0.0085| meta-llama_Llama-3.2-1B-Instruct: 3h 31m 2s ✅ Benchmark completed for meta-llama_Llama-3.2-1B-Instruct 🔥 Starting benchmark for meta-llama_Llama-3.2-3B-Instruct Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 2 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-3.2-3B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (2) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4470|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4180|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.4308|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.4326|± |0.0145| | | |none | 0|acc_norm |↑ | 0.4590|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.5564|± |0.0056| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8840|± |0.0203| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5134|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5560|± |0.0315| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0600|± |0.0151| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3840|± |0.0308| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7120|± |0.0287| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4640|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6520|± |0.0302| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.7720|± |0.0266| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8280|± |0.0239| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5822|± |0.0410| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6560|± |0.0301| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4480|± |0.0315| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5899|± |0.0370| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8440|± |0.0230| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3600|± |0.0304| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.2040|± |0.0255| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2440|± |0.0272| |boolq | 2|none | 0|acc |↑ | 0.7847|± |0.0072| |drop | 3|none | 0|em |↑ | 0.0259|± |0.0016| | | |none | 0|f1 |↑ | 0.1554|± |0.0025| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0960|± |0.0210| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1162|± |0.0228| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2172|± |0.0294| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3030|± |0.0327| | | |none | 0|acc_norm |↑ | 0.3030|± |0.0327| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2828|± |0.0321| | | |none | 0|acc_norm |↑ | 0.2828|± |0.0321| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1136|± |0.0136| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1374|± |0.0147| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2326|± |0.0181| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2802|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2802|± |0.0192| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3407|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3407|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1317|± |0.0160| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1183|± |0.0153| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2344|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3147|± |0.0220| | | |none | 0|acc_norm |↑ | 0.3147|± |0.0220| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3281|± |0.0222| | | |none | 0|acc_norm |↑ | 0.3281|± |0.0222| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6497|± |0.0131| | | |strict-match | 5|exact_match|↑ | 0.6422|± |0.0132| |hellaswag | 1|none | 0|acc |↑ | 0.5225|± |0.0050| | | |none | 0|acc_norm |↑ | 0.7054|± |0.0045| |mmlu | 2|none | |acc |↑ | 0.6052|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.5949|± |0.0070| | - formal_logic | 1|none | 0|acc |↑ | 0.3968|± |0.0438| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7333|± |0.0345| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7500|± |0.0304| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7975|± |0.0262| | - international_law | 1|none | 0|acc |↑ | 0.7355|± |0.0403| | - jurisprudence | 1|none | 0|acc |↑ | 0.6204|± |0.0469| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7239|± |0.0351| | - moral_disputes | 1|none | 0|acc |↑ | 0.6503|± |0.0257| | - moral_scenarios | 1|none | 0|acc |↑ | 0.5955|± |0.0164| | - philosophy | 1|none | 0|acc |↑ | 0.6559|± |0.0270| | - prehistory | 1|none | 0|acc |↑ | 0.6512|± |0.0265| | - professional_law | 1|none | 0|acc |↑ | 0.4622|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.7602|± |0.0327| | - other | 2|none | |acc |↑ | 0.6598|± |0.0082| | - business_ethics | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6226|± |0.0298| | - college_medicine | 1|none | 0|acc |↑ | 0.5896|± |0.0375| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.5830|± |0.0331| | - management | 1|none | 0|acc |↑ | 0.7670|± |0.0419| | - marketing | 1|none | 0|acc |↑ | 0.8761|± |0.0216| | - medical_genetics | 1|none | 0|acc |↑ | 0.7400|± |0.0441| | - miscellaneous | 1|none | 0|acc |↑ | 0.7535|± |0.0154| | - nutrition | 1|none | 0|acc |↑ | 0.6634|± |0.0271| | - professional_accounting | 1|none | 0|acc |↑ | 0.4752|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.7463|± |0.0264| | - virology | 1|none | 0|acc |↑ | 0.4518|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.6675|± |0.0083| | - econometrics | 1|none | 0|acc |↑ | 0.3947|± |0.0460| | - high_school_geography | 1|none | 0|acc |↑ | 0.7273|± |0.0317| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.7513|± |0.0312| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5590|± |0.0252| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6218|± |0.0315| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7651|± |0.0182| | - human_sexuality | 1|none | 0|acc |↑ | 0.6794|± |0.0409| | - professional_psychology | 1|none | 0|acc |↑ | 0.6111|± |0.0197| | - public_relations | 1|none | 0|acc |↑ | 0.6091|± |0.0467| | - security_studies | 1|none | 0|acc |↑ | 0.6612|± |0.0303| | - sociology | 1|none | 0|acc |↑ | 0.8109|± |0.0277| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8200|± |0.0386| | - stem | 2|none | |acc |↑ | 0.5059|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - anatomy | 1|none | 0|acc |↑ | 0.6000|± |0.0423| | - astronomy | 1|none | 0|acc |↑ | 0.6776|± |0.0380| | - college_biology | 1|none | 0|acc |↑ | 0.7083|± |0.0380| | - college_chemistry | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - college_computer_science | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - college_physics | 1|none | 0|acc |↑ | 0.3529|± |0.0476| | - computer_security | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5106|± |0.0327| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5793|± |0.0411| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4127|± |0.0254| | - high_school_biology | 1|none | 0|acc |↑ | 0.7065|± |0.0259| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5369|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6100|± |0.0490| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3667|± |0.0294| | - high_school_physics | 1|none | 0|acc |↑ | 0.4040|± |0.0401| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4167|± |0.0336| | - machine_learning | 1|none | 0|acc |↑ | 0.5000|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1391|± |0.0058| |openbookqa | 1|none | 0|acc |↑ | 0.2740|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3580|± |0.0215| |piqa | 1|none | 0|acc |↑ | 0.7552|± |0.0100| | | |none | 0|acc_norm |↑ | 0.7552|± |0.0100| |qnli | 1|none | 0|acc |↑ | 0.5451|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9520|± |0.0068| | | |none | 0|acc_norm |↑ | 0.9320|± |0.0080| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3389|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5581|± |0.0174| | | |none | 0|bleu_diff |↑ |13.1234|± |1.1545| | | |none | 0|bleu_max |↑ |35.5541|± |0.8709| | | |none | 0|rouge1_acc |↑ | 0.5508|± |0.0174| | | |none | 0|rouge1_diff|↑ |18.8892|± |1.6130| | | |none | 0|rouge1_max |↑ |60.6706|± |0.9979| | | |none | 0|rouge2_acc |↑ | 0.5067|± |0.0175| | | |none | 0|rouge2_diff|↑ |19.4222|± |1.7185| | | |none | 0|rouge2_max |↑ |47.9947|± |1.2171| | | |none | 0|rougeL_acc |↑ | 0.5361|± |0.0175| | | |none | 0|rougeL_diff|↑ |18.4665|± |1.6326| | | |none | 0|rougeL_max |↑ |58.5696|± |1.0349| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3268|± |0.0164| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4976|± |0.0148| |winogrande | 1|none | 0|acc |↑ | 0.6709|± |0.0132| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5564|± |0.0056| |mmlu | 2|none | |acc |↑ |0.6052|± |0.0040| | - humanities | 2|none | |acc |↑ |0.5949|± |0.0070| | - other | 2|none | |acc |↑ |0.6598|± |0.0082| | - social sciences| 2|none | |acc |↑ |0.6675|± |0.0083| | - stem | 2|none | |acc |↑ |0.5059|± |0.0086| meta-llama_Llama-3.2-3B-Instruct: 7h 12m 29s ✅ Benchmark completed for meta-llama_Llama-3.2-3B-Instruct 🔥 Starting benchmark for meta-llama_Llama-3.1-8B-Instruct Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-3.1-8B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4820|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4670|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4433|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.5179|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5503|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.7156|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5775|± |0.0362| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.4920|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7800|± |0.0263| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6680|± |0.0298| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8760|± |0.0209| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7945|± |0.0336| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7680|± |0.0268| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6517|± |0.0358| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9400|± |0.0151| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8840|± |0.0203| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.8280|± |0.0239| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| |boolq | 2|none | 0|acc |↑ | 0.8416|± |0.0064| |drop | 3|none | 0|em |↑ | 0.0448|± |0.0021| | | |none | 0|f1 |↑ | 0.1937|± |0.0028| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1010|± |0.0215| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1313|± |0.0241| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3182|± |0.0332| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3535|± |0.0341| | | |none | 0|acc_norm |↑ | 0.3535|± |0.0341| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3232|± |0.0333| | | |none | 0|acc_norm |↑ | 0.3232|± |0.0333| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1374|± |0.0147| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1703|± |0.0161| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2894|± |0.0194| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3150|± |0.0199| | | |none | 0|acc_norm |↑ | 0.3150|± |0.0199| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3132|± |0.0199| | | |none | 0|acc_norm |↑ | 0.3132|± |0.0199| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1362|± |0.0162| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1272|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2723|± |0.0211| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3393|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3393|± |0.0224| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3438|± |0.0225| | | |none | 0|acc_norm |↑ | 0.3438|± |0.0225| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7779|± |0.0114| | | |strict-match | 5|exact_match|↑ | 0.7544|± |0.0119| |hellaswag | 1|none | 0|acc |↑ | 0.5909|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7921|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.6793|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.6427|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.4762|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7636|± |0.0332| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8431|± |0.0255| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8565|± |0.0228| | - international_law | 1|none | 0|acc |↑ | 0.8182|± |0.0352| | - jurisprudence | 1|none | 0|acc |↑ | 0.7778|± |0.0402| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7914|± |0.0319| | - moral_disputes | 1|none | 0|acc |↑ | 0.7457|± |0.0234| | - moral_scenarios | 1|none | 0|acc |↑ | 0.5721|± |0.0165| | - philosophy | 1|none | 0|acc |↑ | 0.7203|± |0.0255| | - prehistory | 1|none | 0|acc |↑ | 0.7438|± |0.0243| | - professional_law | 1|none | 0|acc |↑ | 0.5039|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8363|± |0.0284| | - other | 2|none | |acc |↑ | 0.7422|± |0.0075| | - business_ethics | 1|none | 0|acc |↑ | 0.6800|± |0.0469| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7849|± |0.0253| | - college_medicine | 1|none | 0|acc |↑ | 0.6936|± |0.0351| | - global_facts | 1|none | 0|acc |↑ | 0.3800|± |0.0488| | - human_aging | 1|none | 0|acc |↑ | 0.7040|± |0.0306| | - management | 1|none | 0|acc |↑ | 0.8155|± |0.0384| | - marketing | 1|none | 0|acc |↑ | 0.8932|± |0.0202| | - medical_genetics | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - miscellaneous | 1|none | 0|acc |↑ | 0.8404|± |0.0131| | - nutrition | 1|none | 0|acc |↑ | 0.7549|± |0.0246| | - professional_accounting | 1|none | 0|acc |↑ | 0.5532|± |0.0297| | - professional_medicine | 1|none | 0|acc |↑ | 0.7831|± |0.0250| | - virology | 1|none | 0|acc |↑ | 0.5181|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7689|± |0.0074| | - econometrics | 1|none | 0|acc |↑ | 0.5000|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.7929|± |0.0289| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8756|± |0.0238| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6795|± |0.0237| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7941|± |0.0263| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8587|± |0.0149| | - human_sexuality | 1|none | 0|acc |↑ | 0.7863|± |0.0360| | - professional_psychology | 1|none | 0|acc |↑ | 0.7173|± |0.0182| | - public_relations | 1|none | 0|acc |↑ | 0.6818|± |0.0446| | - security_studies | 1|none | 0|acc |↑ | 0.7510|± |0.0277| | - sociology | 1|none | 0|acc |↑ | 0.8607|± |0.0245| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8700|± |0.0338| | - stem | 2|none | |acc |↑ | 0.5845|± |0.0084| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - anatomy | 1|none | 0|acc |↑ | 0.6889|± |0.0400| | - astronomy | 1|none | 0|acc |↑ | 0.7566|± |0.0349| | - college_biology | 1|none | 0|acc |↑ | 0.8125|± |0.0326| | - college_chemistry | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - college_computer_science | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - college_mathematics | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - college_physics | 1|none | 0|acc |↑ | 0.4314|± |0.0493| | - computer_security | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6000|± |0.0320| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6552|± |0.0396| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4868|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.8065|± |0.0225| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6404|± |0.0338| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4148|± |0.0300| | - high_school_physics | 1|none | 0|acc |↑ | 0.4636|± |0.0407| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5463|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.4643|± |0.0473| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1776|± |0.0064| |openbookqa | 1|none | 0|acc |↑ | 0.3360|± |0.0211| | | |none | 0|acc_norm |↑ | 0.4320|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.8009|± |0.0093| | | |none | 0|acc_norm |↑ | 0.8063|± |0.0092| |qnli | 1|none | 0|acc |↑ | 0.5014|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9670|± |0.0057| | | |none | 0|acc_norm |↑ | 0.9620|± |0.0060| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5182|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.6255|± |0.0169| | | |none | 0|bleu_diff |↑ |15.3392|± |1.0952| | | |none | 0|bleu_max |↑ |36.1393|± |0.8796| | | |none | 0|rouge1_acc |↑ | 0.6083|± |0.0171| | | |none | 0|rouge1_diff|↑ |21.4366|± |1.5980| | | |none | 0|rouge1_max |↑ |60.7499|± |0.9981| | | |none | 0|rouge2_acc |↑ | 0.5606|± |0.0174| | | |none | 0|rouge2_diff|↑ |23.0331|± |1.6393| | | |none | 0|rouge2_max |↑ |48.3161|± |1.2088| | | |none | 0|rougeL_acc |↑ | 0.6083|± |0.0171| | | |none | 0|rougeL_diff|↑ |21.4950|± |1.6133| | | |none | 0|rougeL_max |↑ |58.8871|± |1.0298| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3660|± |0.0169| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5412|± |0.0150| |winogrande | 1|none | 0|acc |↑ | 0.7388|± |0.0123| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.7156|± |0.0051| |mmlu | 2|none | |acc |↑ |0.6793|± |0.0038| | - humanities | 2|none | |acc |↑ |0.6427|± |0.0067| | - other | 2|none | |acc |↑ |0.7422|± |0.0075| | - social sciences| 2|none | |acc |↑ |0.7689|± |0.0074| | - stem | 2|none | |acc |↑ |0.5845|± |0.0084| meta-llama_Llama-3.1-8B-Instruct: 12h 19m 31s ✅ Benchmark completed for meta-llama_Llama-3.1-8B-Instruct 🔥 Starting benchmark for meta-llama_Llama-2-7b-chat-hf Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 4 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 4 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-2-7b-chat-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (4) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4170|± |0.0156| |anli_r2 | 1|none | 0|acc |↑ | 0.4100|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.4075|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.4411|± |0.0145| | | |none | 0|acc_norm |↑ | 0.4428|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4013|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.6520|± |0.0302| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5455|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5280|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3560|± |0.0303| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1960|± |0.0252| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.6080|± |0.0309| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0200|± |0.0089| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.3288|± |0.0390| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.3840|± |0.0308| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4720|± |0.0316| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4320|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5056|± |0.0376| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1160|± |0.0203| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1680|± |0.0237| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1560|± |0.0230| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.4840|± |0.0317| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0640|± |0.0155| |boolq | 2|none | 0|acc |↑ | 0.7979|± |0.0070| |drop | 3|none | 0|em |↑ | 0.0358|± |0.0019| | | |none | 0|f1 |↑ | 0.1175|± |0.0025| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1717|± |0.0269| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1717|± |0.0269| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2980|± |0.0326| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2929|± |0.0324| | | |none | 0|acc_norm |↑ | 0.2929|± |0.0324| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2525|± |0.0310| | | |none | 0|acc_norm |↑ | 0.2525|± |0.0310| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2216|± |0.0178| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2106|± |0.0175| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2637|± |0.0189| | | |none | 0|acc_norm |↑ | 0.2637|± |0.0189| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2692|± |0.0190| | | |none | 0|acc_norm |↑ | 0.2692|± |0.0190| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1629|± |0.0175| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1786|± |0.0181| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2522|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2946|± |0.0216| | | |none | 0|acc_norm |↑ | 0.2946|± |0.0216| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2612|± |0.0208| | | |none | 0|acc_norm |↑ | 0.2612|± |0.0208| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.2320|± |0.0116| | | |strict-match | 5|exact_match|↑ | 0.2320|± |0.0116| |hellaswag | 1|none | 0|acc |↑ | 0.5779|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7548|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.4636|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4332|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.2381|± |0.0381| | - high_school_european_history | 1|none | 0|acc |↑ | 0.5818|± |0.0385| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6618|± |0.0332| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6203|± |0.0316| | - international_law | 1|none | 0|acc |↑ | 0.5950|± |0.0448| | - jurisprudence | 1|none | 0|acc |↑ | 0.5741|± |0.0478| | - logical_fallacies | 1|none | 0|acc |↑ | 0.5767|± |0.0388| | - moral_disputes | 1|none | 0|acc |↑ | 0.5058|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.5273|± |0.0284| | - prehistory | 1|none | 0|acc |↑ | 0.5463|± |0.0277| | - professional_law | 1|none | 0|acc |↑ | 0.3592|± |0.0123| | - world_religions | 1|none | 0|acc |↑ | 0.6901|± |0.0355| | - other | 2|none | |acc |↑ | 0.5488|± |0.0086| | - business_ethics | 1|none | 0|acc |↑ | 0.4500|± |0.0500| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5509|± |0.0306| | - college_medicine | 1|none | 0|acc |↑ | 0.3815|± |0.0370| | - global_facts | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - human_aging | 1|none | 0|acc |↑ | 0.5830|± |0.0331| | - management | 1|none | 0|acc |↑ | 0.6796|± |0.0462| | - marketing | 1|none | 0|acc |↑ | 0.7564|± |0.0281| | - medical_genetics | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - miscellaneous | 1|none | 0|acc |↑ | 0.6897|± |0.0165| | - nutrition | 1|none | 0|acc |↑ | 0.4902|± |0.0286| | - professional_accounting | 1|none | 0|acc |↑ | 0.3652|± |0.0287| | - professional_medicine | 1|none | 0|acc |↑ | 0.4154|± |0.0299| | - virology | 1|none | 0|acc |↑ | 0.4639|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.5304|± |0.0087| | - econometrics | 1|none | 0|acc |↑ | 0.2982|± |0.0430| | - high_school_geography | 1|none | 0|acc |↑ | 0.5909|± |0.0350| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6839|± |0.0336| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4000|± |0.0248| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.3613|± |0.0312| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6349|± |0.0206| | - human_sexuality | 1|none | 0|acc |↑ | 0.5649|± |0.0435| | - professional_psychology | 1|none | 0|acc |↑ | 0.4673|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5364|± |0.0478| | - security_studies | 1|none | 0|acc |↑ | 0.4980|± |0.0320| | - sociology | 1|none | 0|acc |↑ | 0.7413|± |0.0310| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - stem | 2|none | |acc |↑ | 0.3600|± |0.0084| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3000|± |0.0461| | - anatomy | 1|none | 0|acc |↑ | 0.4444|± |0.0429| | - astronomy | 1|none | 0|acc |↑ | 0.4934|± |0.0407| | - college_biology | 1|none | 0|acc |↑ | 0.4514|± |0.0416| | - college_chemistry | 1|none | 0|acc |↑ | 0.2500|± |0.0435| | - college_computer_science | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - college_mathematics | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - college_physics | 1|none | 0|acc |↑ | 0.1961|± |0.0395| | - computer_security | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4000|± |0.0320| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4483|± |0.0414| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.2751|± |0.0230| | - high_school_biology | 1|none | 0|acc |↑ | 0.4935|± |0.0284| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3350|± |0.0332| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2704|± |0.0271| | - high_school_physics | 1|none | 0|acc |↑ | 0.2781|± |0.0366| | - high_school_statistics | 1|none | 0|acc |↑ | 0.2685|± |0.0302| | - machine_learning | 1|none | 0|acc |↑ | 0.3571|± |0.0455| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0668|± |0.0042| |openbookqa | 1|none | 0|acc |↑ | 0.3340|± |0.0211| | | |none | 0|acc_norm |↑ | 0.4380|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7644|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7715|± |0.0098| |qnli | 1|none | 0|acc |↑ | 0.5801|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9400|± |0.0075| | | |none | 0|acc_norm |↑ | 0.8780|± |0.0104| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1904|± |0.0029| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4517|± |0.0174| | | |none | 0|bleu_diff |↑ |-1.6714|± |0.6136| | | |none | 0|bleu_max |↑ |20.5268|± |0.7001| | | |none | 0|rouge1_acc |↑ | 0.4468|± |0.0174| | | |none | 0|rouge1_diff|↑ |-1.6626|± |0.7580| | | |none | 0|rouge1_max |↑ |45.4458|± |0.8003| | | |none | 0|rouge2_acc |↑ | 0.3807|± |0.0170| | | |none | 0|rouge2_diff|↑ |-3.1513|± |0.8822| | | |none | 0|rouge2_max |↑ |30.2564|± |0.8906| | | |none | 0|rougeL_acc |↑ | 0.4480|± |0.0174| | | |none | 0|rougeL_diff|↑ |-1.9429|± |0.7563| | | |none | 0|rougeL_max |↑ |42.1653|± |0.8032| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3023|± |0.0161| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4532|± |0.0156| |winogrande | 1|none | 0|acc |↑ | 0.6646|± |0.0133| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4013|± |0.0055| |mmlu | 2|none | |acc |↑ |0.4636|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4332|± |0.0069| | - other | 2|none | |acc |↑ |0.5488|± |0.0086| | - social sciences| 2|none | |acc |↑ |0.5304|± |0.0087| | - stem | 2|none | |acc |↑ |0.3600|± |0.0084| meta-llama_Llama-2-7b-chat-hf: 6h 58m 9s ✅ Benchmark completed for meta-llama_Llama-2-7b-chat-hf 🔥 Starting benchmark for mistralai_Mistral-Nemo-Instruct-2407 Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 mistralai_Mistral-Nemo-Instruct-2407: 0h 7m 53s ✅ Benchmark completed for mistralai_Mistral-Nemo-Instruct-2407 🔥 Starting benchmark for mistralai_Ministral-8B-Instruct-2410 Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/mistralai_Ministral-8B-Instruct-2410), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4880|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4870|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4658|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5452|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5623|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.6925|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.6096|± |0.0358| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8200|± |0.0243| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0560|± |0.0146| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.6240|± |0.0307| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7200|± |0.0285| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8160|± |0.0246| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7260|± |0.0370| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6400|± |0.0304| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6517|± |0.0358| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9240|± |0.0168| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.7720|± |0.0266| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.4200|± |0.0313| |boolq | 2|none | 0|acc |↑ | 0.8602|± |0.0061| |drop | 3|none | 0|em |↑ | 0.0229|± |0.0015| | | |none | 0|f1 |↑ | 0.0714|± |0.0021| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1566|± |0.0259| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2121|± |0.0291| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2828|± |0.0321| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3081|± |0.0329| | | |none | 0|acc_norm |↑ | 0.3081|± |0.0329| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3384|± |0.0337| | | |none | 0|acc_norm |↑ | 0.3384|± |0.0337| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1978|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3095|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3645|± |0.0206| | | |none | 0|acc_norm |↑ | 0.3645|± |0.0206| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2299|± |0.0199| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2254|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3058|± |0.0218| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3125|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3125|± |0.0219| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3415|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3415|± |0.0224| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7786|± |0.0114| | | |strict-match | 5|exact_match|↑ | 0.7748|± |0.0115| |hellaswag | 1|none | 0|acc |↑ | 0.5959|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7911|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.6407|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5792|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.4365|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7515|± |0.0337| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8480|± |0.0252| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8186|± |0.0251| | - international_law | 1|none | 0|acc |↑ | 0.7851|± |0.0375| | - jurisprudence | 1|none | 0|acc |↑ | 0.7870|± |0.0396| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7546|± |0.0338| | - moral_disputes | 1|none | 0|acc |↑ | 0.6792|± |0.0251| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3453|± |0.0159| | - philosophy | 1|none | 0|acc |↑ | 0.7042|± |0.0259| | - prehistory | 1|none | 0|acc |↑ | 0.6821|± |0.0259| | - professional_law | 1|none | 0|acc |↑ | 0.4922|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8012|± |0.0306| | - other | 2|none | |acc |↑ | 0.7123|± |0.0079| | - business_ethics | 1|none | 0|acc |↑ | 0.6200|± |0.0488| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6792|± |0.0287| | - college_medicine | 1|none | 0|acc |↑ | 0.6590|± |0.0361| | - global_facts | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - human_aging | 1|none | 0|acc |↑ | 0.6906|± |0.0310| | - management | 1|none | 0|acc |↑ | 0.7767|± |0.0412| | - marketing | 1|none | 0|acc |↑ | 0.8675|± |0.0222| | - medical_genetics | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - miscellaneous | 1|none | 0|acc |↑ | 0.8186|± |0.0138| | - nutrition | 1|none | 0|acc |↑ | 0.7549|± |0.0246| | - professional_accounting | 1|none | 0|acc |↑ | 0.5142|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.7206|± |0.0273| | - virology | 1|none | 0|acc |↑ | 0.5542|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.7439|± |0.0077| | - econometrics | 1|none | 0|acc |↑ | 0.4386|± |0.0467| | - high_school_geography | 1|none | 0|acc |↑ | 0.7980|± |0.0286| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8860|± |0.0229| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6641|± |0.0239| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7017|± |0.0297| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8532|± |0.0152| | - human_sexuality | 1|none | 0|acc |↑ | 0.7481|± |0.0381| | - professional_psychology | 1|none | 0|acc |↑ | 0.6667|± |0.0191| | - public_relations | 1|none | 0|acc |↑ | 0.6818|± |0.0446| | - security_studies | 1|none | 0|acc |↑ | 0.7469|± |0.0278| | - sociology | 1|none | 0|acc |↑ | 0.8358|± |0.0262| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8700|± |0.0338| | - stem | 2|none | |acc |↑ | 0.5614|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - anatomy | 1|none | 0|acc |↑ | 0.6741|± |0.0405| | - astronomy | 1|none | 0|acc |↑ | 0.7237|± |0.0364| | - college_biology | 1|none | 0|acc |↑ | 0.7708|± |0.0351| | - college_chemistry | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - college_computer_science | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_mathematics | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - college_physics | 1|none | 0|acc |↑ | 0.3725|± |0.0481| | - computer_security | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5745|± |0.0323| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5517|± |0.0414| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4577|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7968|± |0.0229| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5320|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3889|± |0.0297| | - high_school_physics | 1|none | 0|acc |↑ | 0.4371|± |0.0405| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5972|± |0.0334| | - machine_learning | 1|none | 0|acc |↑ | 0.5179|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1576|± |0.0061| |openbookqa | 1|none | 0|acc |↑ | 0.3640|± |0.0215| | | |none | 0|acc_norm |↑ | 0.4660|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.8096|± |0.0092| | | |none | 0|acc_norm |↑ | 0.8232|± |0.0089| |qnli | 1|none | 0|acc |↑ | 0.4950|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9680|± |0.0056| | | |none | 0|acc_norm |↑ | 0.9560|± |0.0065| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5278|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.0734|± |0.0091| | | |none | 0|bleu_diff |↑ |-1.3264|± |0.3132| | | |none | 0|bleu_max |↑ | 4.1401|± |0.4048| | | |none | 0|rouge1_acc |↑ | 0.0759|± |0.0093| | | |none | 0|rouge1_diff|↑ |-2.7245|± |0.4932| | | |none | 0|rouge1_max |↑ | 9.7318|± |0.7309| | | |none | 0|rouge2_acc |↑ | 0.0661|± |0.0087| | | |none | 0|rouge2_diff|↑ |-2.5882|± |0.5097| | | |none | 0|rouge2_max |↑ | 6.3193|± |0.5845| | | |none | 0|rougeL_acc |↑ | 0.0722|± |0.0091| | | |none | 0|rougeL_diff|↑ |-2.8073|± |0.4921| | | |none | 0|rougeL_max |↑ | 9.1139|± |0.6993| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3256|± |0.0164| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4867|± |0.0147| |winogrande | 1|none | 0|acc |↑ | 0.7380|± |0.0124| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6925|± |0.0051| |mmlu | 2|none | |acc |↑ |0.6407|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5792|± |0.0068| | - other | 2|none | |acc |↑ |0.7123|± |0.0079| | - social sciences| 2|none | |acc |↑ |0.7439|± |0.0077| | - stem | 2|none | |acc |↑ |0.5614|± |0.0085| mistralai_Ministral-8B-Instruct-2410: 10h 46m 2s ✅ Benchmark completed for mistralai_Ministral-8B-Instruct-2410 🔥 Starting benchmark for mistralai_Ministral-8B-Instruct-2410 Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/mistralai_Ministral-8B-Instruct-2410), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4880|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4870|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4658|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5452|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5623|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.6925|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.6096|± |0.0358| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8200|± |0.0243| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0560|± |0.0146| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.6240|± |0.0307| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7200|± |0.0285| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8160|± |0.0246| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7260|± |0.0370| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6400|± |0.0304| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6517|± |0.0358| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9240|± |0.0168| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.7720|± |0.0266| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.4200|± |0.0313| |boolq | 2|none | 0|acc |↑ | 0.8602|± |0.0061| |drop | 3|none | 0|em |↑ | 0.0229|± |0.0015| | | |none | 0|f1 |↑ | 0.0714|± |0.0021| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1566|± |0.0259| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2121|± |0.0291| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2828|± |0.0321| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3081|± |0.0329| | | |none | 0|acc_norm |↑ | 0.3081|± |0.0329| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3384|± |0.0337| | | |none | 0|acc_norm |↑ | 0.3384|± |0.0337| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1978|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3095|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3645|± |0.0206| | | |none | 0|acc_norm |↑ | 0.3645|± |0.0206| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2299|± |0.0199| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2254|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3058|± |0.0218| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3125|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3125|± |0.0219| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3415|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3415|± |0.0224| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7786|± |0.0114| | | |strict-match | 5|exact_match|↑ | 0.7748|± |0.0115| |hellaswag | 1|none | 0|acc |↑ | 0.5959|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7911|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.6407|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5792|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.4365|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7515|± |0.0337| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8480|± |0.0252| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8186|± |0.0251| | - international_law | 1|none | 0|acc |↑ | 0.7851|± |0.0375| | - jurisprudence | 1|none | 0|acc |↑ | 0.7870|± |0.0396| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7546|± |0.0338| | - moral_disputes | 1|none | 0|acc |↑ | 0.6792|± |0.0251| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3453|± |0.0159| | - philosophy | 1|none | 0|acc |↑ | 0.7042|± |0.0259| | - prehistory | 1|none | 0|acc |↑ | 0.6821|± |0.0259| | - professional_law | 1|none | 0|acc |↑ | 0.4922|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8012|± |0.0306| | - other | 2|none | |acc |↑ | 0.7123|± |0.0079| | - business_ethics | 1|none | 0|acc |↑ | 0.6200|± |0.0488| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6792|± |0.0287| | - college_medicine | 1|none | 0|acc |↑ | 0.6590|± |0.0361| | - global_facts | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - human_aging | 1|none | 0|acc |↑ | 0.6906|± |0.0310| | - management | 1|none | 0|acc |↑ | 0.7767|± |0.0412| | - marketing | 1|none | 0|acc |↑ | 0.8675|± |0.0222| | - medical_genetics | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - miscellaneous | 1|none | 0|acc |↑ | 0.8186|± |0.0138| | - nutrition | 1|none | 0|acc |↑ | 0.7549|± |0.0246| | - professional_accounting | 1|none | 0|acc |↑ | 0.5142|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.7206|± |0.0273| | - virology | 1|none | 0|acc |↑ | 0.5542|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.7439|± |0.0077| | - econometrics | 1|none | 0|acc |↑ | 0.4386|± |0.0467| | - high_school_geography | 1|none | 0|acc |↑ | 0.7980|± |0.0286| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8860|± |0.0229| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6641|± |0.0239| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7017|± |0.0297| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8532|± |0.0152| | - human_sexuality | 1|none | 0|acc |↑ | 0.7481|± |0.0381| | - professional_psychology | 1|none | 0|acc |↑ | 0.6667|± |0.0191| | - public_relations | 1|none | 0|acc |↑ | 0.6818|± |0.0446| | - security_studies | 1|none | 0|acc |↑ | 0.7469|± |0.0278| | - sociology | 1|none | 0|acc |↑ | 0.8358|± |0.0262| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8700|± |0.0338| | - stem | 2|none | |acc |↑ | 0.5614|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - anatomy | 1|none | 0|acc |↑ | 0.6741|± |0.0405| | - astronomy | 1|none | 0|acc |↑ | 0.7237|± |0.0364| | - college_biology | 1|none | 0|acc |↑ | 0.7708|± |0.0351| | - college_chemistry | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - college_computer_science | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_mathematics | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - college_physics | 1|none | 0|acc |↑ | 0.3725|± |0.0481| | - computer_security | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5745|± |0.0323| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5517|± |0.0414| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4577|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7968|± |0.0229| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5320|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3889|± |0.0297| | - high_school_physics | 1|none | 0|acc |↑ | 0.4371|± |0.0405| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5972|± |0.0334| | - machine_learning | 1|none | 0|acc |↑ | 0.5179|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1576|± |0.0061| |openbookqa | 1|none | 0|acc |↑ | 0.3640|± |0.0215| | | |none | 0|acc_norm |↑ | 0.4660|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.8096|± |0.0092| | | |none | 0|acc_norm |↑ | 0.8232|± |0.0089| |qnli | 1|none | 0|acc |↑ | 0.4950|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9680|± |0.0056| | | |none | 0|acc_norm |↑ | 0.9560|± |0.0065| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5278|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.0734|± |0.0091| | | |none | 0|bleu_diff |↑ |-1.3264|± |0.3132| | | |none | 0|bleu_max |↑ | 4.1401|± |0.4048| | | |none | 0|rouge1_acc |↑ | 0.0759|± |0.0093| | | |none | 0|rouge1_diff|↑ |-2.7245|± |0.4932| | | |none | 0|rouge1_max |↑ | 9.7318|± |0.7309| | | |none | 0|rouge2_acc |↑ | 0.0661|± |0.0087| | | |none | 0|rouge2_diff|↑ |-2.5882|± |0.5097| | | |none | 0|rouge2_max |↑ | 6.3193|± |0.5845| | | |none | 0|rougeL_acc |↑ | 0.0722|± |0.0091| | | |none | 0|rougeL_diff|↑ |-2.8073|± |0.4921| | | |none | 0|rougeL_max |↑ | 9.1139|± |0.6993| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3256|± |0.0164| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4867|± |0.0147| |winogrande | 1|none | 0|acc |↑ | 0.7380|± |0.0124| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6925|± |0.0051| |mmlu | 2|none | |acc |↑ |0.6407|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5792|± |0.0068| | - other | 2|none | |acc |↑ |0.7123|± |0.0079| | - social sciences| 2|none | |acc |↑ |0.7439|± |0.0077| | - stem | 2|none | |acc |↑ |0.5614|± |0.0085| mistralai_Ministral-8B-Instruct-2410: 10h 46m 19s ✅ Benchmark completed for mistralai_Ministral-8B-Instruct-2410 🔥 Starting benchmark for google_gemma-3-4b-it Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 4 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 4 hf (pretrained=/home/jaymin/Documents/llm/llm_models/google_gemma-3-4b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (4) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4920|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4710|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4683|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5341|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5708|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.7094|± |0.0050| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5775|± |0.0362| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7760|± |0.0264| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.3440|± |0.0301| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9760|± |0.0097| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7000|± |0.0290| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8680|± |0.0215| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.9120|± |0.0180| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7466|± |0.0361| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.7240|± |0.0283| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6760|± |0.0297| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6404|± |0.0361| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.7840|± |0.0261| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9760|± |0.0097| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| |boolq | 2|none | 0|acc |↑ | 0.8398|± |0.0064| |drop | 3|none | 0|em |↑ | 0.0055|± |0.0008| | | |none | 0|f1 |↑ | 0.0893|± |0.0018| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0152|± |0.0087| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1515|± |0.0255| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3384|± |0.0337| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3535|± |0.0341| | | |none | 0|acc_norm |↑ | 0.3535|± |0.0341| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1740|± |0.0162| | | |strict-match | 0|exact_match|↑ | 0.0147|± |0.0051| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1612|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2985|± |0.0196| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2711|± |0.0190| | | |none | 0|acc_norm |↑ | 0.2711|± |0.0190| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3407|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3407|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1496|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0067|± |0.0039| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1496|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3125|± |0.0219| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3080|± |0.0218| | | |none | 0|acc_norm |↑ | 0.3080|± |0.0218| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2879|± |0.0214| | | |none | 0|acc_norm |↑ | 0.2879|± |0.0214| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7665|± |0.0117| | | |strict-match | 5|exact_match|↑ | 0.7619|± |0.0117| |hellaswag | 1|none | 0|acc |↑ | 0.5599|± |0.0050| | | |none | 0|acc_norm |↑ | 0.7414|± |0.0044| |mmlu | 2|none | |acc |↑ | 0.5756|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5163|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.3571|± |0.0429| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7455|± |0.0340| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7451|± |0.0306| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7468|± |0.0283| | - international_law | 1|none | 0|acc |↑ | 0.7438|± |0.0398| | - jurisprudence | 1|none | 0|acc |↑ | 0.7037|± |0.0441| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7178|± |0.0354| | - moral_disputes | 1|none | 0|acc |↑ | 0.6301|± |0.0260| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2480|± |0.0144| | - philosophy | 1|none | 0|acc |↑ | 0.6592|± |0.0269| | - prehistory | 1|none | 0|acc |↑ | 0.6821|± |0.0259| | - professional_law | 1|none | 0|acc |↑ | 0.4237|± |0.0126| | - world_religions | 1|none | 0|acc |↑ | 0.7778|± |0.0319| | - other | 2|none | |acc |↑ | 0.6369|± |0.0083| | - business_ethics | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6566|± |0.0292| | - college_medicine | 1|none | 0|acc |↑ | 0.5723|± |0.0377| | - global_facts | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - human_aging | 1|none | 0|acc |↑ | 0.6278|± |0.0324| | - management | 1|none | 0|acc |↑ | 0.7282|± |0.0441| | - marketing | 1|none | 0|acc |↑ | 0.8462|± |0.0236| | - medical_genetics | 1|none | 0|acc |↑ | 0.6300|± |0.0485| | - miscellaneous | 1|none | 0|acc |↑ | 0.7573|± |0.0153| | - nutrition | 1|none | 0|acc |↑ | 0.6438|± |0.0274| | - professional_accounting | 1|none | 0|acc |↑ | 0.3901|± |0.0291| | - professional_medicine | 1|none | 0|acc |↑ | 0.5772|± |0.0300| | - virology | 1|none | 0|acc |↑ | 0.5060|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.6744|± |0.0083| | - econometrics | 1|none | 0|acc |↑ | 0.4649|± |0.0469| | - high_school_geography | 1|none | 0|acc |↑ | 0.7020|± |0.0326| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8135|± |0.0281| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5718|± |0.0251| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6387|± |0.0312| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7780|± |0.0178| | - human_sexuality | 1|none | 0|acc |↑ | 0.6641|± |0.0414| | - professional_psychology | 1|none | 0|acc |↑ | 0.5948|± |0.0199| | - public_relations | 1|none | 0|acc |↑ | 0.6455|± |0.0458| | - security_studies | 1|none | 0|acc |↑ | 0.6980|± |0.0294| | - sociology | 1|none | 0|acc |↑ | 0.7612|± |0.0301| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - stem | 2|none | |acc |↑ | 0.5071|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - anatomy | 1|none | 0|acc |↑ | 0.5481|± |0.0430| | - astronomy | 1|none | 0|acc |↑ | 0.6908|± |0.0376| | - college_biology | 1|none | 0|acc |↑ | 0.6875|± |0.0388| | - college_chemistry | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - college_computer_science | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - college_physics | 1|none | 0|acc |↑ | 0.3725|± |0.0481| | - computer_security | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5404|± |0.0326| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5310|± |0.0416| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4841|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7065|± |0.0259| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5074|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3815|± |0.0296| | - high_school_physics | 1|none | 0|acc |↑ | 0.3245|± |0.0382| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4074|± |0.0335| | - machine_learning | 1|none | 0|acc |↑ | 0.3571|± |0.0455| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1094|± |0.0052| |openbookqa | 1|none | 0|acc |↑ | 0.3640|± |0.0215| | | |none | 0|acc_norm |↑ | 0.4660|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7628|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7720|± |0.0098| |qnli | 1|none | 0|acc |↑ | 0.5660|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9550|± |0.0066| | | |none | 0|acc_norm |↑ | 0.9310|± |0.0080| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3148|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4333|± |0.0173| | | |none | 0|bleu_diff |↑ |-1.4479|± |0.5140| | | |none | 0|bleu_max |↑ |18.0994|± |0.6738| | | |none | 0|rouge1_acc |↑ | 0.4235|± |0.0173| | | |none | 0|rouge1_diff|↑ |-2.6851|± |0.6899| | | |none | 0|rouge1_max |↑ |41.5023|± |0.8412| | | |none | 0|rouge2_acc |↑ | 0.3195|± |0.0163| | | |none | 0|rouge2_diff|↑ |-4.2870|± |0.7901| | | |none | 0|rouge2_max |↑ |25.1379|± |0.9049| | | |none | 0|rougeL_acc |↑ | 0.4247|± |0.0173| | | |none | 0|rougeL_diff|↑ |-2.9853|± |0.6819| | | |none | 0|rougeL_max |↑ |38.8207|± |0.8446| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3488|± |0.0167| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5188|± |0.0160| |winogrande | 1|none | 0|acc |↑ | 0.7009|± |0.0129| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.7094|± |0.0050| |mmlu | 2|none | |acc |↑ |0.5756|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5163|± |0.0068| | - other | 2|none | |acc |↑ |0.6369|± |0.0083| | - social sciences| 2|none | |acc |↑ |0.6744|± |0.0083| | - stem | 2|none | |acc |↑ |0.5071|± |0.0086| google_gemma-3-4b-it: 4h 51m 14s ✅ Benchmark completed for google_gemma-3-4b-it 🔥 Starting benchmark for google_gemma-3-1b-it Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/google_gemma-3-1b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|-------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3320|± |0.0149| |anli_r2 | 1|none | 0|acc |↑ | 0.3540|± |0.0151| |anli_r3 | 1|none | 0|acc |↑ | 0.3567|± |0.0138| |arc_challenge | 1|none | 0|acc |↑ | 0.3532|± |0.0140| | | |none | 0|acc_norm |↑ | 0.3805|± |0.0142| |bbh | 3|get-answer | |exact_match|↑ | 0.3823|± |0.0053| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8320|± |0.0237| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5134|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.2680|± |0.0281| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0920|± |0.0183| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5040|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.1240|± |0.0209| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1560|± |0.0230| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.3960|± |0.0310| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.3080|± |0.0293| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6560|± |0.0301| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.7440|± |0.0277| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.1712|± |0.0313| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.1600|± |0.0232| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3640|± |0.0305| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.1160|± |0.0203| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5112|± |0.0376| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.6120|± |0.0309| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2240|± |0.0264| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1600|± |0.0232| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5040|± |0.0317| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.8360|± |0.0235| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| |boolq | 2|none | 0|acc |↑ | 0.7581|± |0.0075| |drop | 3|none | 0|em |↑ | 0.0018|± |0.0004| | | |none | 0|f1 |↑ | 0.0762|± |0.0017| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1263|± |0.0237| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1364|± |0.0245| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2172|± |0.0294| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2626|± |0.0314| | | |none | 0|acc_norm |↑ | 0.2626|± |0.0314| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2374|± |0.0303| | | |none | 0|acc_norm |↑ | 0.2374|± |0.0303| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1300|± |0.0144| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1520|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2601|± |0.0188| | | |strict-match | 0|exact_match|↑ | 0.0037|± |0.0026| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2491|± |0.0185| | | |none | 0|acc_norm |↑ | 0.2491|± |0.0185| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2637|± |0.0189| | | |none | 0|acc_norm |↑ | 0.2637|± |0.0189| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1183|± |0.0153| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1607|± |0.0174| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2344|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2679|± |0.0209| | | |none | 0|acc_norm |↑ | 0.2679|± |0.0209| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2656|± |0.0209| | | |none | 0|acc_norm |↑ | 0.2656|± |0.0209| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.2502|± |0.0119| | | |strict-match | 5|exact_match|↑ | 0.2472|± |0.0119| |hellaswag | 1|none | 0|acc |↑ | 0.4338|± |0.0049| | | |none | 0|acc_norm |↑ | 0.5783|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.3859|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.3626|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.3492|± |0.0426| | - high_school_european_history | 1|none | 0|acc |↑ | 0.4909|± |0.0390| | - high_school_us_history | 1|none | 0|acc |↑ | 0.4706|± |0.0350| | - high_school_world_history | 1|none | 0|acc |↑ | 0.4726|± |0.0325| | - international_law | 1|none | 0|acc |↑ | 0.5372|± |0.0455| | - jurisprudence | 1|none | 0|acc |↑ | 0.4722|± |0.0483| | - logical_fallacies | 1|none | 0|acc |↑ | 0.4417|± |0.0390| | - moral_disputes | 1|none | 0|acc |↑ | 0.4220|± |0.0266| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2335|± |0.0141| | - philosophy | 1|none | 0|acc |↑ | 0.4244|± |0.0281| | - prehistory | 1|none | 0|acc |↑ | 0.4414|± |0.0276| | - professional_law | 1|none | 0|acc |↑ | 0.3057|± |0.0118| | - world_religions | 1|none | 0|acc |↑ | 0.5029|± |0.0383| | - other | 2|none | |acc |↑ | 0.4335|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.4226|± |0.0304| | - college_medicine | 1|none | 0|acc |↑ | 0.3815|± |0.0370| | - global_facts | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - human_aging | 1|none | 0|acc |↑ | 0.4529|± |0.0334| | - management | 1|none | 0|acc |↑ | 0.5631|± |0.0491| | - marketing | 1|none | 0|acc |↑ | 0.6239|± |0.0317| | - medical_genetics | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - miscellaneous | 1|none | 0|acc |↑ | 0.5147|± |0.0179| | - nutrition | 1|none | 0|acc |↑ | 0.4150|± |0.0282| | - professional_accounting | 1|none | 0|acc |↑ | 0.2730|± |0.0266| | - professional_medicine | 1|none | 0|acc |↑ | 0.2941|± |0.0277| | - virology | 1|none | 0|acc |↑ | 0.3795|± |0.0378| | - social sciences | 2|none | |acc |↑ | 0.4482|± |0.0088| | - econometrics | 1|none | 0|acc |↑ | 0.2193|± |0.0389| | - high_school_geography | 1|none | 0|acc |↑ | 0.5000|± |0.0356| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.4870|± |0.0361| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.3282|± |0.0238| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.3782|± |0.0315| | - high_school_psychology | 1|none | 0|acc |↑ | 0.5321|± |0.0214| | - human_sexuality | 1|none | 0|acc |↑ | 0.5038|± |0.0439| | - professional_psychology | 1|none | 0|acc |↑ | 0.3644|± |0.0195| | - public_relations | 1|none | 0|acc |↑ | 0.4727|± |0.0478| | - security_studies | 1|none | 0|acc |↑ | 0.5388|± |0.0319| | - sociology | 1|none | 0|acc |↑ | 0.5970|± |0.0347| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - stem | 2|none | |acc |↑ | 0.3130|± |0.0081| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2700|± |0.0446| | - anatomy | 1|none | 0|acc |↑ | 0.4593|± |0.0430| | - astronomy | 1|none | 0|acc |↑ | 0.3684|± |0.0393| | - college_biology | 1|none | 0|acc |↑ | 0.3403|± |0.0396| | - college_chemistry | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - college_computer_science | 1|none | 0|acc |↑ | 0.2600|± |0.0441| | - college_mathematics | 1|none | 0|acc |↑ | 0.2300|± |0.0423| | - college_physics | 1|none | 0|acc |↑ | 0.2059|± |0.0402| | - computer_security | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - conceptual_physics | 1|none | 0|acc |↑ | 0.3745|± |0.0316| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4345|± |0.0413| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.2302|± |0.0217| | - high_school_biology | 1|none | 0|acc |↑ | 0.4323|± |0.0282| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.2857|± |0.0318| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2370|± |0.0259| | - high_school_physics | 1|none | 0|acc |↑ | 0.2252|± |0.0341| | - high_school_statistics | 1|none | 0|acc |↑ | 0.2083|± |0.0277| | - machine_learning | 1|none | 0|acc |↑ | 0.3750|± |0.0460| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0357|± |0.0031| |openbookqa | 1|none | 0|acc |↑ | 0.3020|± |0.0206| | | |none | 0|acc_norm |↑ | 0.3880|± |0.0218| |piqa | 1|none | 0|acc |↑ | 0.7182|± |0.0105| | | |none | 0|acc_norm |↑ | 0.7209|± |0.0105| |qnli | 1|none | 0|acc |↑ | 0.4941|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9040|± |0.0093| | | |none | 0|acc_norm |↑ | 0.8580|± |0.0110| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1897|± |0.0029| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3060|± |0.0161| | | |none | 0|bleu_diff |↑ | -7.1778|± |0.7355| | | |none | 0|bleu_max |↑ | 23.2944|± |0.7624| | | |none | 0|rouge1_acc |↑ | 0.2644|± |0.0154| | | |none | 0|rouge1_diff|↑ |-10.0231|± |0.7875| | | |none | 0|rouge1_max |↑ | 46.4515|± |0.9083| | | |none | 0|rouge2_acc |↑ | 0.2044|± |0.0141| | | |none | 0|rouge2_diff|↑ |-11.5180|± |0.9589| | | |none | 0|rouge2_max |↑ | 30.6640|± |0.9977| | | |none | 0|rougeL_acc |↑ | 0.2570|± |0.0153| | | |none | 0|rougeL_diff|↑ |-10.3014|± |0.7848| | | |none | 0|rougeL_max |↑ | 43.9439|± |0.9131| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2460|± |0.0151| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.3875|± |0.0152| |winogrande | 1|none | 0|acc |↑ | 0.5896|± |0.0138| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.3823|± |0.0053| |mmlu | 2|none | |acc |↑ |0.3859|± |0.0040| | - humanities | 2|none | |acc |↑ |0.3626|± |0.0069| | - other | 2|none | |acc |↑ |0.4335|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.4482|± |0.0088| | - stem | 2|none | |acc |↑ |0.3130|± |0.0081| google_gemma-3-1b-it: 6h 50m 53s ✅ Benchmark completed for google_gemma-3-1b-it 🔥 Starting benchmark for google_gemma-3-12b-it Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 2 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 2 hf (pretrained=/home/jaymin/Documents/llm/llm_models/google_gemma-3-12b-it), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (2) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6030|± |0.0155| |anli_r2 | 1|none | 0|acc |↑ | 0.5600|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.5958|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.6084|± |0.0143| | | |none | 0|acc_norm |↑ | 0.6109|± |0.0142| |bbh | 3|get-answer | |exact_match|↑ | 0.8019|± |0.0044| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5829|± |0.0362| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3440|± |0.0301| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7960|± |0.0255| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9520|± |0.0135| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9760|± |0.0097| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.9680|± |0.0112| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.8836|± |0.0266| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7920|± |0.0257| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.8371|± |0.0278| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9680|± |0.0112| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.6760|± |0.0297| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.9720|± |0.0105| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.5480|± |0.0315| |boolq | 2|none | 0|acc |↑ | 0.8746|± |0.0058| |drop | 3|none | 0|em |↑ | 0.0214|± |0.0015| | | |none | 0|f1 |↑ | 0.1396|± |0.0023| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1616|± |0.0262| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0909|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2374|± |0.0303| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3384|± |0.0337| | | |none | 0|acc_norm |↑ | 0.3384|± |0.0337| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1575|± |0.0156| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1410|± |0.0149| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2436|± |0.0184| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3278|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3278|± |0.0201| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3077|± |0.0198| | | |none | 0|acc_norm |↑ | 0.3077|± |0.0198| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1763|± |0.0180| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1518|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2277|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3371|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3371|± |0.0224| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3371|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3371|± |0.0224| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8848|± |0.0088| | | |strict-match | 5|exact_match|↑ | 0.8772|± |0.0090| |hellaswag | 1|none | 0|acc |↑ | 0.6266|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8188|± |0.0038| |mmlu | 2|none | |acc |↑ | 0.7161|± |0.0036| | - humanities | 2|none | |acc |↑ | 0.6387|± |0.0065| | - formal_logic | 1|none | 0|acc |↑ | 0.5556|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8606|± |0.0270| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8431|± |0.0255| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8945|± |0.0200| | - international_law | 1|none | 0|acc |↑ | 0.8595|± |0.0317| | - jurisprudence | 1|none | 0|acc |↑ | 0.8056|± |0.0383| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8344|± |0.0292| | - moral_disputes | 1|none | 0|acc |↑ | 0.7717|± |0.0226| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3676|± |0.0161| | - philosophy | 1|none | 0|acc |↑ | 0.7910|± |0.0231| | - prehistory | 1|none | 0|acc |↑ | 0.8148|± |0.0216| | - professional_law | 1|none | 0|acc |↑ | 0.5424|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.8421|± |0.0280| | - other | 2|none | |acc |↑ | 0.7692|± |0.0073| | - business_ethics | 1|none | 0|acc |↑ | 0.7700|± |0.0423| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7962|± |0.0248| | - college_medicine | 1|none | 0|acc |↑ | 0.7225|± |0.0341| | - global_facts | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7668|± |0.0284| | - management | 1|none | 0|acc |↑ | 0.8350|± |0.0368| | - marketing | 1|none | 0|acc |↑ | 0.9017|± |0.0195| | - medical_genetics | 1|none | 0|acc |↑ | 0.8300|± |0.0378| | - miscellaneous | 1|none | 0|acc |↑ | 0.8608|± |0.0124| | - nutrition | 1|none | 0|acc |↑ | 0.7680|± |0.0242| | - professional_accounting | 1|none | 0|acc |↑ | 0.5461|± |0.0297| | - professional_medicine | 1|none | 0|acc |↑ | 0.8088|± |0.0239| | - virology | 1|none | 0|acc |↑ | 0.5723|± |0.0385| | - social sciences | 2|none | |acc |↑ | 0.8213|± |0.0068| | - econometrics | 1|none | 0|acc |↑ | 0.6053|± |0.0460| | - high_school_geography | 1|none | 0|acc |↑ | 0.8535|± |0.0252| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9326|± |0.0181| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7821|± |0.0209| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8487|± |0.0233| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8954|± |0.0131| | - human_sexuality | 1|none | 0|acc |↑ | 0.8321|± |0.0328| | - professional_psychology | 1|none | 0|acc |↑ | 0.7712|± |0.0170| | - public_relations | 1|none | 0|acc |↑ | 0.7091|± |0.0435| | - security_studies | 1|none | 0|acc |↑ | 0.7633|± |0.0272| | - sociology | 1|none | 0|acc |↑ | 0.8806|± |0.0229| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.9100|± |0.0288| | - stem | 2|none | |acc |↑ | 0.6768|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - anatomy | 1|none | 0|acc |↑ | 0.7037|± |0.0394| | - astronomy | 1|none | 0|acc |↑ | 0.8487|± |0.0292| | - college_biology | 1|none | 0|acc |↑ | 0.8819|± |0.0270| | - college_chemistry | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - college_computer_science | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_mathematics | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_physics | 1|none | 0|acc |↑ | 0.6275|± |0.0481| | - computer_security | 1|none | 0|acc |↑ | 0.8000|± |0.0402| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7745|± |0.0273| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6690|± |0.0392| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6455|± |0.0246| | - high_school_biology | 1|none | 0|acc |↑ | 0.8677|± |0.0193| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6847|± |0.0327| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8300|± |0.0378| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4926|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.5430|± |0.0407| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6713|± |0.0320| | - machine_learning | 1|none | 0|acc |↑ | 0.5893|± |0.0467| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1571|± |0.0061| |openbookqa | 1|none | 0|acc |↑ | 0.4220|± |0.0221| | | |none | 0|acc_norm |↑ | 0.4980|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.8014|± |0.0093| | | |none | 0|acc_norm |↑ | 0.7807|± |0.0097| |qnli | 1|none | 0|acc |↑ | 0.7457|± |0.0059| |sciq | 1|none | 0|acc |↑ | 0.9720|± |0.0052| | | |none | 0|acc_norm |↑ | 0.9540|± |0.0066| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2752|± |0.0033| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4786|± |0.0175| | | |none | 0|bleu_diff |↑ |-0.4518|± |0.3853| | | |none | 0|bleu_max |↑ |12.5016|± |0.5371| | | |none | 0|rouge1_acc |↑ | 0.5141|± |0.0175| | | |none | 0|rouge1_diff|↑ |-0.2991|± |0.5781| | | |none | 0|rouge1_max |↑ |35.1025|± |0.7280| | | |none | 0|rouge2_acc |↑ | 0.4125|± |0.0172| | | |none | 0|rouge2_diff|↑ |-1.6691|± |0.6548| | | |none | 0|rouge2_max |↑ |20.2480|± |0.7443| | | |none | 0|rougeL_acc |↑ | 0.4957|± |0.0175| | | |none | 0|rougeL_diff|↑ |-0.8130|± |0.5698| | | |none | 0|rougeL_max |↑ |31.3688|± |0.7201| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4051|± |0.0172| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5812|± |0.0160| |winogrande | 1|none | 0|acc |↑ | 0.7443|± |0.0123| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.8019|± |0.0044| |mmlu | 2|none | |acc |↑ |0.7161|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6387|± |0.0065| | - other | 2|none | |acc |↑ |0.7692|± |0.0073| | - social sciences| 2|none | |acc |↑ |0.8213|± |0.0068| | - stem | 2|none | |acc |↑ |0.6768|± |0.0080| google_gemma-3-12b-it: 15h 46m 6s ✅ Benchmark completed for google_gemma-3-12b-it 🔥 Starting benchmark for meta-llama_Llama-2-13b-chat-hf Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-2-13b-chat-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4300|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4300|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.4142|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.4616|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5017|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.4780|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7200|± |0.0285| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5829|± |0.0362| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6600|± |0.0300| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0640|± |0.0155| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4200|± |0.0313| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3960|± |0.0310| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3480|± |0.0302| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0320|± |0.0112| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6440|± |0.0303| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4452|± |0.0413| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4120|± |0.0312| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4400|± |0.0315| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7191|± |0.0338| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9280|± |0.0164| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2000|± |0.0253| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2160|± |0.0261| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1840|± |0.0246| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3520|± |0.0303| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2080|± |0.0257| |boolq | 2|none | 0|acc |↑ | 0.8165|± |0.0068| |drop | 3|none | 0|em |↑ | 0.0073|± |0.0009| | | |none | 0|f1 |↑ | 0.0915|± |0.0020| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1212|± |0.0233| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2071|± |0.0289| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2424|± |0.0305| | | |none | 0|acc_norm |↑ | 0.2424|± |0.0305| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2222|± |0.0296| | | |none | 0|acc_norm |↑ | 0.2222|± |0.0296| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1630|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2033|± |0.0172| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1960|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2875|± |0.0194| | | |none | 0|acc_norm |↑ | 0.2875|± |0.0194| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2766|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2766|± |0.0192| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1741|± |0.0179| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1786|± |0.0181| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1607|± |0.0174| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3192|± |0.0220| | | |none | 0|acc_norm |↑ | 0.3192|± |0.0220| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2991|± |0.0217| | | |none | 0|acc_norm |↑ | 0.2991|± |0.0217| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.3556|± |0.0132| | | |strict-match | 5|exact_match|↑ | 0.3472|± |0.0131| |hellaswag | 1|none | 0|acc |↑ | 0.6071|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7967|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.5313|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4978|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.2381|± |0.0381| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6667|± |0.0368| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7402|± |0.0308| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7257|± |0.0290| | - international_law | 1|none | 0|acc |↑ | 0.7190|± |0.0410| | - jurisprudence | 1|none | 0|acc |↑ | 0.6944|± |0.0445| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6871|± |0.0364| | - moral_disputes | 1|none | 0|acc |↑ | 0.6012|± |0.0264| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2760|± |0.0150| | - philosophy | 1|none | 0|acc |↑ | 0.6463|± |0.0272| | - prehistory | 1|none | 0|acc |↑ | 0.6265|± |0.0269| | - professional_law | 1|none | 0|acc |↑ | 0.4003|± |0.0125| | - world_religions | 1|none | 0|acc |↑ | 0.7719|± |0.0322| | - other | 2|none | |acc |↑ | 0.6061|± |0.0084| | - business_ethics | 1|none | 0|acc |↑ | 0.5400|± |0.0501| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5887|± |0.0303| | - college_medicine | 1|none | 0|acc |↑ | 0.4220|± |0.0377| | - global_facts | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - human_aging | 1|none | 0|acc |↑ | 0.6233|± |0.0325| | - management | 1|none | 0|acc |↑ | 0.6893|± |0.0458| | - marketing | 1|none | 0|acc |↑ | 0.7991|± |0.0262| | - medical_genetics | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - miscellaneous | 1|none | 0|acc |↑ | 0.7663|± |0.0151| | - nutrition | 1|none | 0|acc |↑ | 0.6111|± |0.0279| | - professional_accounting | 1|none | 0|acc |↑ | 0.4078|± |0.0293| | - professional_medicine | 1|none | 0|acc |↑ | 0.4963|± |0.0304| | - virology | 1|none | 0|acc |↑ | 0.4639|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.6136|± |0.0085| | - econometrics | 1|none | 0|acc |↑ | 0.2456|± |0.0405| | - high_school_geography | 1|none | 0|acc |↑ | 0.6515|± |0.0339| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.7565|± |0.0310| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5026|± |0.0254| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.5126|± |0.0325| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7156|± |0.0193| | - human_sexuality | 1|none | 0|acc |↑ | 0.6412|± |0.0421| | - professional_psychology | 1|none | 0|acc |↑ | 0.5408|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.6273|± |0.0463| | - security_studies | 1|none | 0|acc |↑ | 0.6571|± |0.0304| | - sociology | 1|none | 0|acc |↑ | 0.7512|± |0.0306| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - stem | 2|none | |acc |↑ | 0.4272|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3000|± |0.0461| | - anatomy | 1|none | 0|acc |↑ | 0.5185|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.5789|± |0.0402| | - college_biology | 1|none | 0|acc |↑ | 0.5833|± |0.0412| | - college_chemistry | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - college_computer_science | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - college_mathematics | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - college_physics | 1|none | 0|acc |↑ | 0.2745|± |0.0444| | - computer_security | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4128|± |0.0322| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5448|± |0.0415| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3201|± |0.0240| | - high_school_biology | 1|none | 0|acc |↑ | 0.6419|± |0.0273| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4286|± |0.0348| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.5500|± |0.0500| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2741|± |0.0272| | - high_school_physics | 1|none | 0|acc |↑ | 0.3377|± |0.0386| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3426|± |0.0324| | - machine_learning | 1|none | 0|acc |↑ | 0.3304|± |0.0446| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1030|± |0.0051| |openbookqa | 1|none | 0|acc |↑ | 0.3520|± |0.0214| | | |none | 0|acc_norm |↑ | 0.4400|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7780|± |0.0097| | | |none | 0|acc_norm |↑ | 0.7933|± |0.0094| |qnli | 1|none | 0|acc |↑ | 0.5438|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9510|± |0.0068| | | |none | 0|acc_norm |↑ | 0.9050|± |0.0093| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2725|± |0.0033| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4088|± |0.0172| | | |none | 0|bleu_diff |↑ |-2.0148|± |0.7195| | | |none | 0|bleu_max |↑ |26.0719|± |0.7783| | | |none | 0|rouge1_acc |↑ | 0.4235|± |0.0173| | | |none | 0|rouge1_diff|↑ |-3.1237|± |0.8531| | | |none | 0|rouge1_max |↑ |51.9853|± |0.8214| | | |none | 0|rouge2_acc |↑ | 0.3501|± |0.0167| | | |none | 0|rouge2_diff|↑ |-4.0918|± |0.9904| | | |none | 0|rouge2_max |↑ |36.4465|± |0.9660| | | |none | 0|rougeL_acc |↑ | 0.4186|± |0.0173| | | |none | 0|rougeL_diff|↑ |-3.1432|± |0.8645| | | |none | 0|rougeL_max |↑ |49.1291|± |0.8443| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2803|± |0.0157| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4396|± |0.0157| |winogrande | 1|none | 0|acc |↑ | 0.7119|± |0.0127| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4780|± |0.0055| |mmlu | 2|none | |acc |↑ |0.5313|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4978|± |0.0068| | - other | 2|none | |acc |↑ |0.6061|± |0.0084| | - social sciences| 2|none | |acc |↑ |0.6136|± |0.0085| | - stem | 2|none | |acc |↑ |0.4272|± |0.0085| meta-llama_Llama-2-13b-chat-hf: 17h 9m 0s ✅ Benchmark completed for meta-llama_Llama-2-13b-chat-hf 🔥 Starting benchmark for meta-llama_Llama-2-13b-hf Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/meta-llama_Llama-2-13b-hf), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|-------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3770|± |0.0153| |anli_r2 | 1|none | 0|acc |↑ | 0.3900|± |0.0154| |anli_r3 | 1|none | 0|acc |↑ | 0.3850|± |0.0141| |arc_challenge | 1|none | 0|acc |↑ | 0.4829|± |0.0146| | | |none | 0|acc_norm |↑ | 0.4898|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.4777|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5401|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7240|± |0.0283| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5520|± |0.0315| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4640|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4160|± |0.0312| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.6520|± |0.0302| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7280|± |0.0282| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0240|± |0.0097| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.7400|± |0.0278| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4932|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4080|± |0.0311| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2760|± |0.0283| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5506|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2600|± |0.0278| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2160|± |0.0261| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1680|± |0.0237| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3560|± |0.0303| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.1960|± |0.0252| |boolq | 2|none | 0|acc |↑ | 0.8064|± |0.0069| |drop | 3|none | 0|em |↑ | 0.0033|± |0.0006| | | |none | 0|f1 |↑ | 0.0301|± |0.0011| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1111|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0152|± |0.0087| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1061|± |0.0219| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2626|± |0.0314| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2727|± |0.0317| | | |none | 0|acc_norm |↑ | 0.2727|± |0.0317| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2525|± |0.0310| | | |none | 0|acc_norm |↑ | 0.2525|± |0.0310| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1282|± |0.0143| | | |strict-match | 0|exact_match|↑ | 0.0183|± |0.0057| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1410|± |0.0149| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2985|± |0.0196| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2912|± |0.0195| | | |none | 0|acc_norm |↑ | 0.2912|± |0.0195| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2784|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2784|± |0.0192| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1362|± |0.0162| | | |strict-match | 0|exact_match|↑ | 0.0134|± |0.0054| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1094|± |0.0148| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2879|± |0.0214| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2879|± |0.0214| | | |none | 0|acc_norm |↑ | 0.2879|± |0.0214| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2545|± |0.0206| | | |none | 0|acc_norm |↑ | 0.2545|± |0.0206| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.2328|± |0.0116| | | |strict-match | 5|exact_match|↑ | 0.2297|± |0.0116| |hellaswag | 1|none | 0|acc |↑ | 0.6005|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7939|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.5209|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4795|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.2857|± |0.0404| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6182|± |0.0379| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6716|± |0.0330| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7089|± |0.0296| | - international_law | 1|none | 0|acc |↑ | 0.7190|± |0.0410| | - jurisprudence | 1|none | 0|acc |↑ | 0.6481|± |0.0462| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6319|± |0.0379| | - moral_disputes | 1|none | 0|acc |↑ | 0.5318|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2469|± |0.0144| | - philosophy | 1|none | 0|acc |↑ | 0.6431|± |0.0272| | - prehistory | 1|none | 0|acc |↑ | 0.6080|± |0.0272| | - professional_law | 1|none | 0|acc |↑ | 0.4048|± |0.0125| | - world_religions | 1|none | 0|acc |↑ | 0.7602|± |0.0327| | - other | 2|none | |acc |↑ | 0.5935|± |0.0085| | - business_ethics | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5887|± |0.0303| | - college_medicine | 1|none | 0|acc |↑ | 0.4971|± |0.0381| | - global_facts | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - human_aging | 1|none | 0|acc |↑ | 0.5650|± |0.0333| | - management | 1|none | 0|acc |↑ | 0.7379|± |0.0435| | - marketing | 1|none | 0|acc |↑ | 0.7564|± |0.0281| | - medical_genetics | 1|none | 0|acc |↑ | 0.5500|± |0.0500| | - miscellaneous | 1|none | 0|acc |↑ | 0.7229|± |0.0160| | - nutrition | 1|none | 0|acc |↑ | 0.6209|± |0.0278| | - professional_accounting | 1|none | 0|acc |↑ | 0.4043|± |0.0293| | - professional_medicine | 1|none | 0|acc |↑ | 0.5257|± |0.0303| | - virology | 1|none | 0|acc |↑ | 0.4337|± |0.0386| | - social sciences | 2|none | |acc |↑ | 0.6113|± |0.0085| | - econometrics | 1|none | 0|acc |↑ | 0.2281|± |0.0395| | - high_school_geography | 1|none | 0|acc |↑ | 0.6818|± |0.0332| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.7565|± |0.0310| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4949|± |0.0253| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.5378|± |0.0324| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7083|± |0.0195| | - human_sexuality | 1|none | 0|acc |↑ | 0.6641|± |0.0414| | - professional_psychology | 1|none | 0|acc |↑ | 0.5278|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.6091|± |0.0467| | - security_studies | 1|none | 0|acc |↑ | 0.6449|± |0.0306| | - sociology | 1|none | 0|acc |↑ | 0.7512|± |0.0306| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - stem | 2|none | |acc |↑ | 0.4231|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2700|± |0.0446| | - anatomy | 1|none | 0|acc |↑ | 0.4815|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.5724|± |0.0403| | - college_biology | 1|none | 0|acc |↑ | 0.5208|± |0.0418| | - college_chemistry | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - college_computer_science | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - college_mathematics | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - college_physics | 1|none | 0|acc |↑ | 0.2451|± |0.0428| | - computer_security | 1|none | 0|acc |↑ | 0.6500|± |0.0479| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4043|± |0.0321| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5172|± |0.0416| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3122|± |0.0239| | - high_school_biology | 1|none | 0|acc |↑ | 0.6516|± |0.0271| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4680|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2556|± |0.0266| | - high_school_physics | 1|none | 0|acc |↑ | 0.3179|± |0.0380| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4352|± |0.0338| | - machine_learning | 1|none | 0|acc |↑ | 0.2589|± |0.0416| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.2363|± |0.0071| |openbookqa | 1|none | 0|acc |↑ | 0.3520|± |0.0214| | | |none | 0|acc_norm |↑ | 0.4520|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7900|± |0.0095| | | |none | 0|acc_norm |↑ | 0.8052|± |0.0092| |qnli | 1|none | 0|acc |↑ | 0.4953|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9460|± |0.0072| | | |none | 0|acc_norm |↑ | 0.9350|± |0.0078| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.6088|± |0.0036| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3011|± |0.0161| | | |none | 0|bleu_diff |↑ |-10.3037|± |0.8896| | | |none | 0|bleu_max |↑ | 29.5100|± |0.8236| | | |none | 0|rouge1_acc |↑ | 0.3072|± |0.0162| | | |none | 0|rouge1_diff|↑ |-12.4090|± |0.8679| | | |none | 0|rouge1_max |↑ | 55.4793|± |0.8343| | | |none | 0|rouge2_acc |↑ | 0.2791|± |0.0157| | | |none | 0|rouge2_diff|↑ |-14.9613|± |1.1075| | | |none | 0|rouge2_max |↑ | 39.8908|± |1.0021| | | |none | 0|rougeL_acc |↑ | 0.2950|± |0.0160| | | |none | 0|rougeL_diff|↑ |-12.8909|± |0.8812| | | |none | 0|rougeL_max |↑ | 52.5536|± |0.8487| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2595|± |0.0153| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.3690|± |0.0136| |winogrande | 1|none | 0|acc |↑ | 0.7222|± |0.0126| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4777|± |0.0054| |mmlu | 2|none | |acc |↑ |0.5209|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4795|± |0.0069| | - other | 2|none | |acc |↑ |0.5935|± |0.0085| | - social sciences| 2|none | |acc |↑ |0.6113|± |0.0085| | - stem | 2|none | |acc |↑ |0.4231|± |0.0085| meta-llama_Llama-2-13b-hf: 19h 21m 36s ✅ Benchmark completed for meta-llama_Llama-2-13b-hf 🔥 Starting benchmark for Qwen_Qwen2-7B-Instruct Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/Qwen_Qwen2-7B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5730|± |0.0156| |anli_r2 | 1|none | 0|acc |↑ | 0.5250|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5225|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5085|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5401|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.5775|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9360|± |0.0155| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4492|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5480|± |0.0315| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6400|± |0.0304| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0480|± |0.0135| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3320|± |0.0298| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4160|± |0.0312| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8240|± |0.0241| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.8320|± |0.0237| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6280|± |0.0306| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.1918|± |0.0327| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4720|± |0.0316| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6124|± |0.0366| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.6840|± |0.0295| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9760|± |0.0097| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| |boolq | 2|none | 0|acc |↑ | 0.8563|± |0.0061| |drop | 3|none | 0|em |↑ | 0.0001|± |0.0001| | | |none | 0|f1 |↑ | 0.0520|± |0.0012| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1364|± |0.0245| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1364|± |0.0245| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3737|± |0.0345| | | |none | 0|acc_norm |↑ | 0.3737|± |0.0345| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3182|± |0.0332| | | |none | 0|acc_norm |↑ | 0.3182|± |0.0332| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1703|± |0.0161| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1410|± |0.0149| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2015|± |0.0172| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3480|± |0.0204| | | |none | 0|acc_norm |↑ | 0.3480|± |0.0204| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3352|± |0.0202| | | |none | 0|acc_norm |↑ | 0.3352|± |0.0202| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1429|± |0.0166| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1384|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1875|± |0.0185| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3125|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3125|± |0.0219| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3147|± |0.0220| | | |none | 0|acc_norm |↑ | 0.3147|± |0.0220| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7362|± |0.0121| | | |strict-match | 5|exact_match|↑ | 0.6467|± |0.0132| |hellaswag | 1|none | 0|acc |↑ | 0.6118|± |0.0049| | | |none | 0|acc_norm |↑ | 0.8060|± |0.0039| |mmlu | 2|none | |acc |↑ | 0.6994|± |0.0037| | - humanities | 2|none | |acc |↑ | 0.6338|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5079|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8061|± |0.0309| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8725|± |0.0234| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8397|± |0.0239| | - international_law | 1|none | 0|acc |↑ | 0.8264|± |0.0346| | - jurisprudence | 1|none | 0|acc |↑ | 0.8519|± |0.0343| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8037|± |0.0312| | - moral_disputes | 1|none | 0|acc |↑ | 0.7717|± |0.0226| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4324|± |0.0166| | - philosophy | 1|none | 0|acc |↑ | 0.7814|± |0.0235| | - prehistory | 1|none | 0|acc |↑ | 0.7840|± |0.0229| | - professional_law | 1|none | 0|acc |↑ | 0.5163|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8304|± |0.0288| | - other | 2|none | |acc |↑ | 0.7586|± |0.0074| | - business_ethics | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7849|± |0.0253| | - college_medicine | 1|none | 0|acc |↑ | 0.6879|± |0.0353| | - global_facts | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7489|± |0.0291| | - management | 1|none | 0|acc |↑ | 0.7961|± |0.0399| | - marketing | 1|none | 0|acc |↑ | 0.9017|± |0.0195| | - medical_genetics | 1|none | 0|acc |↑ | 0.8300|± |0.0378| | - miscellaneous | 1|none | 0|acc |↑ | 0.8570|± |0.0125| | - nutrition | 1|none | 0|acc |↑ | 0.7778|± |0.0238| | - professional_accounting | 1|none | 0|acc |↑ | 0.5887|± |0.0294| | - professional_medicine | 1|none | 0|acc |↑ | 0.7353|± |0.0268| | - virology | 1|none | 0|acc |↑ | 0.5241|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.8021|± |0.0071| | - econometrics | 1|none | 0|acc |↑ | 0.5877|± |0.0463| | - high_school_geography | 1|none | 0|acc |↑ | 0.8788|± |0.0233| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9275|± |0.0187| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7692|± |0.0214| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8319|± |0.0243| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8642|± |0.0147| | - human_sexuality | 1|none | 0|acc |↑ | 0.7710|± |0.0369| | - professional_psychology | 1|none | 0|acc |↑ | 0.7418|± |0.0177| | - public_relations | 1|none | 0|acc |↑ | 0.7364|± |0.0422| | - security_studies | 1|none | 0|acc |↑ | 0.7388|± |0.0281| | - sociology | 1|none | 0|acc |↑ | 0.8756|± |0.0233| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.6388|± |0.0083| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.6000|± |0.0423| | - astronomy | 1|none | 0|acc |↑ | 0.7697|± |0.0343| | - college_biology | 1|none | 0|acc |↑ | 0.7917|± |0.0340| | - college_chemistry | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6200|± |0.0488| | - college_mathematics | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - college_physics | 1|none | 0|acc |↑ | 0.4020|± |0.0488| | - computer_security | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7064|± |0.0298| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7103|± |0.0378| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6376|± |0.0248| | - high_school_biology | 1|none | 0|acc |↑ | 0.8387|± |0.0209| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6207|± |0.0341| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4963|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.5099|± |0.0408| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6759|± |0.0319| | - machine_learning | 1|none | 0|acc |↑ | 0.4732|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0133|± |0.0019| |openbookqa | 1|none | 0|acc |↑ | 0.3460|± |0.0213| | | |none | 0|acc_norm |↑ | 0.4620|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7954|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8058|± |0.0092| |qnli | 1|none | 0|acc |↑ | 0.5471|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9540|± |0.0066| | | |none | 0|acc_norm |↑ | 0.9160|± |0.0088| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0081|± |0.0007| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4774|± |0.0175| | | |none | 0|bleu_diff |↑ | 4.0052|± |0.6796| | | |none | 0|bleu_max |↑ |19.4152|± |0.7487| | | |none | 0|rouge1_acc |↑ | 0.5043|± |0.0175| | | |none | 0|rouge1_diff|↑ | 5.0515|± |0.9714| | | |none | 0|rouge1_max |↑ |42.5509|± |0.9066| | | |none | 0|rouge2_acc |↑ | 0.4186|± |0.0173| | | |none | 0|rouge2_diff|↑ | 5.1321|± |1.0491| | | |none | 0|rouge2_max |↑ |29.4151|± |0.9889| | | |none | 0|rougeL_acc |↑ | 0.4908|± |0.0175| | | |none | 0|rougeL_diff|↑ | 5.0408|± |0.9758| | | |none | 0|rougeL_max |↑ |39.6681|± |0.9155| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4051|± |0.0172| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5734|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.6985|± |0.0129| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5775|± |0.0054| |mmlu | 2|none | |acc |↑ |0.6994|± |0.0037| | - humanities | 2|none | |acc |↑ |0.6338|± |0.0066| | - other | 2|none | |acc |↑ |0.7586|± |0.0074| | - social sciences| 2|none | |acc |↑ |0.8021|± |0.0071| | - stem | 2|none | |acc |↑ |0.6388|± |0.0083| Qwen_Qwen2-7B-Instruct: 11h 30m 41s ✅ Benchmark completed for Qwen_Qwen2-7B-Instruct 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-0528-Qwen3-8B Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/home/jaymin/Documents/llm/llm_models/deepseek-ai_DeepSeek-R1-0528-Qwen3-8B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5110|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4640|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4767|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5137|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5495|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.5841|± |0.0052| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5348|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5000|± |0.0317| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5520|± |0.0315| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1240|± |0.0209| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3360|± |0.0299| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7440|± |0.0277| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.0560|± |0.0146| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6200|± |0.0308| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9520|± |0.0135| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8240|± |0.0241| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.6918|± |0.0383| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5506|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8680|± |0.0215| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9920|± |0.0056| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.3640|± |0.0305| |boolq | 2|none | 0|acc |↑ | 0.8483|± |0.0063| |drop | 3|none | 0|em |↑ | 0.0018|± |0.0004| | | |none | 0|f1 |↑ | 0.0533|± |0.0013| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0556|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0556|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0806|± |0.0117| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0861|± |0.0120| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2106|± |0.0175| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3516|± |0.0205| | | |none | 0|acc_norm |↑ | 0.3516|± |0.0205| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3755|± |0.0207| | | |none | 0|acc_norm |↑ | 0.3755|± |0.0207| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0848|± |0.0132| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0737|± |0.0124| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2299|± |0.0199| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3683|± |0.0228| | | |none | 0|acc_norm |↑ | 0.3683|± |0.0228| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3728|± |0.0229| | | |none | 0|acc_norm |↑ | 0.3728|± |0.0229| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8241|± |0.0105| | | |strict-match | 5|exact_match|↑ | 0.8127|± |0.0107| |hellaswag | 1|none | 0|acc |↑ | 0.5781|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7564|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.6830|± |0.0037| | - humanities | 2|none | |acc |↑ | 0.5690|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.6349|± |0.0431| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8121|± |0.0305| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7941|± |0.0284| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8354|± |0.0241| | - international_law | 1|none | 0|acc |↑ | 0.7521|± |0.0394| | - jurisprudence | 1|none | 0|acc |↑ | 0.7685|± |0.0408| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7607|± |0.0335| | - moral_disputes | 1|none | 0|acc |↑ | 0.7139|± |0.0243| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2793|± |0.0150| | - philosophy | 1|none | 0|acc |↑ | 0.7267|± |0.0253| | - prehistory | 1|none | 0|acc |↑ | 0.7747|± |0.0232| | - professional_law | 1|none | 0|acc |↑ | 0.4531|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.7953|± |0.0309| | - other | 2|none | |acc |↑ | 0.7399|± |0.0076| | - business_ethics | 1|none | 0|acc |↑ | 0.7400|± |0.0441| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7509|± |0.0266| | - college_medicine | 1|none | 0|acc |↑ | 0.7110|± |0.0346| | - global_facts | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7085|± |0.0305| | - management | 1|none | 0|acc |↑ | 0.8835|± |0.0318| | - marketing | 1|none | 0|acc |↑ | 0.8632|± |0.0225| | - medical_genetics | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - miscellaneous | 1|none | 0|acc |↑ | 0.8327|± |0.0133| | - nutrition | 1|none | 0|acc |↑ | 0.7614|± |0.0244| | - professional_accounting | 1|none | 0|acc |↑ | 0.5567|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.7610|± |0.0259| | - virology | 1|none | 0|acc |↑ | 0.4880|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7927|± |0.0072| | - econometrics | 1|none | 0|acc |↑ | 0.5965|± |0.0462| | - high_school_geography | 1|none | 0|acc |↑ | 0.8384|± |0.0262| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9016|± |0.0215| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7513|± |0.0219| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8739|± |0.0216| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8716|± |0.0143| | - human_sexuality | 1|none | 0|acc |↑ | 0.8397|± |0.0322| | - professional_psychology | 1|none | 0|acc |↑ | 0.7059|± |0.0184| | - public_relations | 1|none | 0|acc |↑ | 0.7182|± |0.0431| | - security_studies | 1|none | 0|acc |↑ | 0.7551|± |0.0275| | - sociology | 1|none | 0|acc |↑ | 0.8060|± |0.0280| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8700|± |0.0338| | - stem | 2|none | |acc |↑ | 0.6898|± |0.0079| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4900|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.6815|± |0.0402| | - astronomy | 1|none | 0|acc |↑ | 0.8816|± |0.0263| | - college_biology | 1|none | 0|acc |↑ | 0.8681|± |0.0283| | - college_chemistry | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - college_computer_science | 1|none | 0|acc |↑ | 0.6100|± |0.0490| | - college_mathematics | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - college_physics | 1|none | 0|acc |↑ | 0.5588|± |0.0494| | - computer_security | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - conceptual_physics | 1|none | 0|acc |↑ | 0.8170|± |0.0253| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7517|± |0.0360| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6614|± |0.0244| | - high_school_biology | 1|none | 0|acc |↑ | 0.9000|± |0.0171| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6700|± |0.0331| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8200|± |0.0386| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4778|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.6424|± |0.0391| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6852|± |0.0317| | - machine_learning | 1|none | 0|acc |↑ | 0.5000|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0183|± |0.0022| |openbookqa | 1|none | 0|acc |↑ | 0.3080|± |0.0207| | | |none | 0|acc_norm |↑ | 0.4300|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7633|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7568|± |0.0100| |qnli | 1|none | 0|acc |↑ | 0.5578|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9600|± |0.0062| | | |none | 0|acc_norm |↑ | 0.9410|± |0.0075| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0295|± |0.0013| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5398|± |0.0174| | | |none | 0|bleu_diff |↑ | 5.8931|± |0.7222| | | |none | 0|bleu_max |↑ |19.7647|± |0.7053| | | |none | 0|rouge1_acc |↑ | 0.5569|± |0.0174| | | |none | 0|rouge1_diff|↑ | 9.9292|± |1.0724| | | |none | 0|rouge1_max |↑ |45.0401|± |0.8645| | | |none | 0|rouge2_acc |↑ | 0.4627|± |0.0175| | | |none | 0|rouge2_diff|↑ | 9.8762|± |1.1402| | | |none | 0|rouge2_max |↑ |30.6518|± |0.9760| | | |none | 0|rougeL_acc |↑ | 0.5435|± |0.0174| | | |none | 0|rougeL_diff|↑ | 9.8078|± |1.0753| | | |none | 0|rougeL_max |↑ |41.9636|± |0.8847| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3574|± |0.0168| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5590|± |0.0152| |winogrande | 1|none | 0|acc |↑ | 0.6756|± |0.0132| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5841|± |0.0052| |mmlu | 2|none | |acc |↑ |0.6830|± |0.0037| | - humanities | 2|none | |acc |↑ |0.5690|± |0.0066| | - other | 2|none | |acc |↑ |0.7399|± |0.0076| | - social sciences| 2|none | |acc |↑ |0.7927|± |0.0072| | - stem | 2|none | |acc |↑ |0.6898|± |0.0079| deepseek-ai_DeepSeek-R1-0528-Qwen3-8B: 17h 58m 4s ✅ Benchmark completed for deepseek-ai_DeepSeek-R1-0528-Qwen3-8B 🔥 Starting benchmark for 01-ai_Yi-1.5-9B-Chat 2025-07-27:12:03:18 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/home/jaymin/Documents/llm/llm_models/01-ai_Yi-1.5-9B-Chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5350|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.5090|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5258|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5572|± |0.0145| | | |none | 0|acc_norm |↑ | 0.5870|± |0.0144| |bbh | 3|get-answer | |exact_match|↑ | 0.6107|± |0.0053| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8960|± |0.0193| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5508|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7760|± |0.0264| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.3640|± |0.0305| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.0240|± |0.0097| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5880|± |0.0312| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5560|± |0.0315| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3640|± |0.0305| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9120|± |0.0180| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7560|± |0.0272| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.2680|± |0.0281| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.8356|± |0.0308| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3760|± |0.0307| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5000|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6854|± |0.0349| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.7160|± |0.0286| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7280|± |0.0282| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.6600|± |0.0300| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9720|± |0.0105| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| |boolq | 2|none | 0|acc |↑ | 0.8682|± |0.0059| |drop | 3|none | 0|em |↑ | 0.0149|± |0.0012| | | |none | 0|f1 |↑ | 0.1253|± |0.0021| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1313|± |0.0241| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3384|± |0.0337| | | |none | 0|acc_norm |↑ | 0.3384|± |0.0337| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3131|± |0.0330| | | |none | 0|acc_norm |↑ | 0.3131|± |0.0330| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1795|± |0.0164| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1777|± |0.0164| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1960|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3187|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3187|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3315|± |0.0202| | | |none | 0|acc_norm |↑ | 0.3315|± |0.0202| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1942|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1920|± |0.0186| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2098|± |0.0193| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3237|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3237|± |0.0221| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3036|± |0.0217| | | |none | 0|acc_norm |↑ | 0.3036|± |0.0217| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6732|± |0.0129| | | |strict-match | 5|exact_match|↑ | 0.7081|± |0.0125| |hellaswag | 1|none | 0|acc |↑ | 0.5964|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7873|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.6841|± |0.0037| | - humanities | 2|none | |acc |↑ | 0.6172|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.5556|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8303|± |0.0293| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8676|± |0.0238| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8354|± |0.0241| | - international_law | 1|none | 0|acc |↑ | 0.8099|± |0.0358| | - jurisprudence | 1|none | 0|acc |↑ | 0.7778|± |0.0402| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7975|± |0.0316| | - moral_disputes | 1|none | 0|acc |↑ | 0.7283|± |0.0239| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4547|± |0.0167| | - philosophy | 1|none | 0|acc |↑ | 0.7267|± |0.0253| | - prehistory | 1|none | 0|acc |↑ | 0.7191|± |0.0250| | - professional_law | 1|none | 0|acc |↑ | 0.4922|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8012|± |0.0306| | - other | 2|none | |acc |↑ | 0.7300|± |0.0077| | - business_ethics | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7132|± |0.0278| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - human_aging | 1|none | 0|acc |↑ | 0.7085|± |0.0305| | - management | 1|none | 0|acc |↑ | 0.8447|± |0.0359| | - marketing | 1|none | 0|acc |↑ | 0.8974|± |0.0199| | - medical_genetics | 1|none | 0|acc |↑ | 0.7600|± |0.0429| | - miscellaneous | 1|none | 0|acc |↑ | 0.8199|± |0.0137| | - nutrition | 1|none | 0|acc |↑ | 0.7418|± |0.0251| | - professional_accounting | 1|none | 0|acc |↑ | 0.5993|± |0.0292| | - professional_medicine | 1|none | 0|acc |↑ | 0.6801|± |0.0283| | - virology | 1|none | 0|acc |↑ | 0.5542|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.7813|± |0.0073| | - econometrics | 1|none | 0|acc |↑ | 0.6316|± |0.0454| | - high_school_geography | 1|none | 0|acc |↑ | 0.8333|± |0.0266| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8808|± |0.0234| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7846|± |0.0208| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8277|± |0.0245| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8716|± |0.0143| | - human_sexuality | 1|none | 0|acc |↑ | 0.6870|± |0.0407| | - professional_psychology | 1|none | 0|acc |↑ | 0.6977|± |0.0186| | - public_relations | 1|none | 0|acc |↑ | 0.6455|± |0.0458| | - security_studies | 1|none | 0|acc |↑ | 0.7469|± |0.0278| | - sociology | 1|none | 0|acc |↑ | 0.7910|± |0.0287| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8900|± |0.0314| | - stem | 2|none | |acc |↑ | 0.6438|± |0.0082| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4500|± |0.0500| | - anatomy | 1|none | 0|acc |↑ | 0.6889|± |0.0400| | - astronomy | 1|none | 0|acc |↑ | 0.7500|± |0.0352| | - college_biology | 1|none | 0|acc |↑ | 0.7917|± |0.0340| | - college_chemistry | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6100|± |0.0490| | - college_mathematics | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - college_physics | 1|none | 0|acc |↑ | 0.4902|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7191|± |0.0294| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7172|± |0.0375| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6270|± |0.0249| | - high_school_biology | 1|none | 0|acc |↑ | 0.8516|± |0.0202| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6305|± |0.0340| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8000|± |0.0402| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4407|± |0.0303| | - high_school_physics | 1|none | 0|acc |↑ | 0.4636|± |0.0407| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6759|± |0.0319| | - machine_learning | 1|none | 0|acc |↑ | 0.5536|± |0.0472| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0094|± |0.0016| |openbookqa | 1|none | 0|acc |↑ | 0.3200|± |0.0209| | | |none | 0|acc_norm |↑ | 0.4360|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7965|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8036|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.7877|± |0.0055| |sciq | 1|none | 0|acc |↑ | 0.9590|± |0.0063| | | |none | 0|acc_norm |↑ | 0.9540|± |0.0066| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3387|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4541|± |0.0174| | | |none | 0|bleu_diff |↑ |-0.7696|± |0.5079| | | |none | 0|bleu_max |↑ |18.9123|± |0.6279| | | |none | 0|rouge1_acc |↑ | 0.4602|± |0.0174| | | |none | 0|rouge1_diff|↑ |-1.1341|± |0.6159| | | |none | 0|rouge1_max |↑ |44.4829|± |0.7546| | | |none | 0|rouge2_acc |↑ | 0.4027|± |0.0172| | | |none | 0|rouge2_diff|↑ |-1.7922|± |0.7369| | | |none | 0|rouge2_max |↑ |30.3176|± |0.8139| | | |none | 0|rougeL_acc |↑ | 0.4517|± |0.0174| | | |none | 0|rougeL_diff|↑ |-1.6275|± |0.6211| | | |none | 0|rougeL_max |↑ |40.9909|± |0.7553| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3745|± |0.0169| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5479|± |0.0159| |winogrande | 1|none | 0|acc |↑ | 0.7466|± |0.0122| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6107|± |0.0053| |mmlu | 2|none | |acc |↑ |0.6841|± |0.0037| | - humanities | 2|none | |acc |↑ |0.6172|± |0.0067| | - other | 2|none | |acc |↑ |0.7300|± |0.0077| | - social sciences| 2|none | |acc |↑ |0.7813|± |0.0073| | - stem | 2|none | |acc |↑ |0.6438|± |0.0082| 01-ai_Yi-1.5-9B-Chat: 13h 54m 25s ✅ Benchmark completed for 01-ai_Yi-1.5-9B-Chat 🔥 Starting benchmark for 01-ai_Yi-1.5-6B-Chat fatal: not a git repository (or any of the parent directories): .git 2025-07-27:20:07:27 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/home/jaymin/Documents/llm/llm_models/01-ai_Yi-1.5-6B-Chat), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4770|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4530|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.4600|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5077|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5392|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.5478|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9120|± |0.0180| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5080|± |0.0367| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7040|± |0.0289| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1000|± |0.0190| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.8040|± |0.0252| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3240|± |0.0297| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7480|± |0.0275| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.7120|± |0.0287| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.7280|± |0.0282| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.6096|± |0.0405| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.7000|± |0.0290| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4663|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.1680|± |0.0237| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.7960|± |0.0255| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.1480|± |0.0225| |boolq | 2|none | 0|acc |↑ | 0.8474|± |0.0063| |drop | 3|none | 0|em |↑ | 0.0071|± |0.0009| | | |none | 0|f1 |↑ | 0.1161|± |0.0020| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1465|± |0.0252| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1212|± |0.0233| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2475|± |0.0307| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3333|± |0.0336| | | |none | 0|acc_norm |↑ | 0.3333|± |0.0336| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3182|± |0.0332| | | |none | 0|acc_norm |↑ | 0.3182|± |0.0332| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1392|± |0.0148| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1520|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2198|± |0.0177| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3095|± |0.0198| | | |none | 0|acc_norm |↑ | 0.3095|± |0.0198| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3205|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3205|± |0.0200| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1384|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1540|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1942|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3125|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3125|± |0.0219| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3571|± |0.0227| | | |none | 0|acc_norm |↑ | 0.3571|± |0.0227| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6785|± |0.0129| | | |strict-match | 5|exact_match|↑ | 0.6702|± |0.0129| |hellaswag | 1|none | 0|acc |↑ | 0.5852|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7675|± |0.0042| |mmlu | 2|none | |acc |↑ | 0.6179|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5392|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.4444|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7697|± |0.0329| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7696|± |0.0296| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7848|± |0.0268| | - international_law | 1|none | 0|acc |↑ | 0.7355|± |0.0403| | - jurisprudence | 1|none | 0|acc |↑ | 0.7222|± |0.0433| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7853|± |0.0323| | - moral_disputes | 1|none | 0|acc |↑ | 0.6763|± |0.0252| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2469|± |0.0144| | - philosophy | 1|none | 0|acc |↑ | 0.6752|± |0.0266| | - prehistory | 1|none | 0|acc |↑ | 0.6512|± |0.0265| | - professional_law | 1|none | 0|acc |↑ | 0.4641|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.7485|± |0.0333| | - other | 2|none | |acc |↑ | 0.6794|± |0.0081| | - business_ethics | 1|none | 0|acc |↑ | 0.7400|± |0.0441| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7132|± |0.0278| | - college_medicine | 1|none | 0|acc |↑ | 0.6647|± |0.0360| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.6413|± |0.0322| | - management | 1|none | 0|acc |↑ | 0.8544|± |0.0349| | - marketing | 1|none | 0|acc |↑ | 0.8590|± |0.0228| | - medical_genetics | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - miscellaneous | 1|none | 0|acc |↑ | 0.7778|± |0.0149| | - nutrition | 1|none | 0|acc |↑ | 0.6732|± |0.0269| | - professional_accounting | 1|none | 0|acc |↑ | 0.4716|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6066|± |0.0297| | - virology | 1|none | 0|acc |↑ | 0.4759|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7221|± |0.0079| | - econometrics | 1|none | 0|acc |↑ | 0.5702|± |0.0466| | - high_school_geography | 1|none | 0|acc |↑ | 0.8232|± |0.0272| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8238|± |0.0275| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7308|± |0.0225| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8067|± |0.0256| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8073|± |0.0169| | - human_sexuality | 1|none | 0|acc |↑ | 0.6107|± |0.0428| | - professional_psychology | 1|none | 0|acc |↑ | 0.6046|± |0.0198| | - public_relations | 1|none | 0|acc |↑ | 0.6182|± |0.0465| | - security_studies | 1|none | 0|acc |↑ | 0.6449|± |0.0306| | - sociology | 1|none | 0|acc |↑ | 0.7960|± |0.0285| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8200|± |0.0386| | - stem | 2|none | |acc |↑ | 0.5728|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.5778|± |0.0427| | - astronomy | 1|none | 0|acc |↑ | 0.6711|± |0.0382| | - college_biology | 1|none | 0|acc |↑ | 0.7153|± |0.0377| | - college_chemistry | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_physics | 1|none | 0|acc |↑ | 0.4510|± |0.0495| | - computer_security | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6340|± |0.0315| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6345|± |0.0401| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.5212|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7968|± |0.0229| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5468|± |0.0350| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4259|± |0.0301| | - high_school_physics | 1|none | 0|acc |↑ | 0.3974|± |0.0400| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5694|± |0.0338| | - machine_learning | 1|none | 0|acc |↑ | 0.4018|± |0.0465| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0271|± |0.0027| |openbookqa | 1|none | 0|acc |↑ | 0.3240|± |0.0210| | | |none | 0|acc_norm |↑ | 0.4360|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7835|± |0.0096| | | |none | 0|acc_norm |↑ | 0.7878|± |0.0095| |qnli | 1|none | 0|acc |↑ | 0.6795|± |0.0063| |sciq | 1|none | 0|acc |↑ | 0.9620|± |0.0060| | | |none | 0|acc_norm |↑ | 0.9340|± |0.0079| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3310|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4088|± |0.0172| | | |none | 0|bleu_diff |↑ |-1.4065|± |0.6448| | | |none | 0|bleu_max |↑ |22.2331|± |0.7119| | | |none | 0|rouge1_acc |↑ | 0.4125|± |0.0172| | | |none | 0|rouge1_diff|↑ |-1.8289|± |0.7973| | | |none | 0|rouge1_max |↑ |48.2058|± |0.7942| | | |none | 0|rouge2_acc |↑ | 0.3525|± |0.0167| | | |none | 0|rouge2_diff|↑ |-2.8970|± |0.9086| | | |none | 0|rouge2_max |↑ |33.2543|± |0.9010| | | |none | 0|rougeL_acc |↑ | 0.4039|± |0.0172| | | |none | 0|rougeL_diff|↑ |-1.9691|± |0.8088| | | |none | 0|rougeL_max |↑ |44.8699|± |0.8098| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3770|± |0.0170| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5344|± |0.0159| |winogrande | 1|none | 0|acc |↑ | 0.7096|± |0.0128| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5478|± |0.0055| |mmlu | 2|none | |acc |↑ |0.6179|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5392|± |0.0068| | - other | 2|none | |acc |↑ |0.6794|± |0.0081| | - social sciences| 2|none | |acc |↑ |0.7221|± |0.0079| | - stem | 2|none | |acc |↑ |0.5728|± |0.0086| 01-ai_Yi-1.5-6B-Chat: 8h 4m 9s ✅ Benchmark completed for 01-ai_Yi-1.5-6B-Chat 🔥 Starting benchmark for 01-ai_Yi-1.5-9B fatal: not a git repository (or any of the parent directories): .git 2025-07-28:07:51:08 INFO [loggers.evaluation_tracker:209] Saving results aggregated hf (pretrained=/home/jaymin/Documents/llm/llm_models/01-ai_Yi-1.5-9B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 2 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5320|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4800|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4392|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.5290|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5469|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.7120|± |0.0052| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5668|± |0.0363| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8240|± |0.0241| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.7440|± |0.0277| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.3320|± |0.0298| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5320|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4440|± |0.0315| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8560|± |0.0222| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.8320|± |0.0237| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.7120|± |0.0287| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9240|± |0.0168| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7123|± |0.0376| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.7280|± |0.0282| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7560|± |0.0272| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5618|± |0.0373| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9800|± |0.0089| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7000|± |0.0290| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.6880|± |0.0294| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.7240|± |0.0283| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| |boolq | 2|none | 0|acc |↑ | 0.8581|± |0.0061| |drop | 3|none | 0|em |↑ | 0.4148|± |0.0050| | | |none | 0|f1 |↑ | 0.4457|± |0.0049| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1263|± |0.0237| | | |strict-match | 0|exact_match|↑ | 0.1010|± |0.0215| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0556|± |0.0163| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3283|± |0.0335| | | |strict-match | 0|exact_match|↑ | 0.0202|± |0.0100| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3384|± |0.0337| | | |none | 0|acc_norm |↑ | 0.3384|± |0.0337| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0160| | | |strict-match | 0|exact_match|↑ | 0.0916|± |0.0124| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1703|± |0.0161| | | |strict-match | 0|exact_match|↑ | 0.0440|± |0.0088| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2637|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0183|± |0.0057| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3187|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3187|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3535|± |0.0205| | | |none | 0|acc_norm |↑ | 0.3535|± |0.0205| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1562|± |0.0172| | | |strict-match | 0|exact_match|↑ | 0.0893|± |0.0135| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1875|± |0.0185| | | |strict-match | 0|exact_match|↑ | 0.0402|± |0.0093| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2790|± |0.0212| | | |strict-match | 0|exact_match|↑ | 0.0201|± |0.0066| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3594|± |0.0227| | | |none | 0|acc_norm |↑ | 0.3594|± |0.0227| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2946|± |0.0216| | | |none | 0|acc_norm |↑ | 0.2946|± |0.0216| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6558|± |0.0131| | | |strict-match | 5|exact_match|↑ | 0.6391|± |0.0132| |hellaswag | 1|none | 0|acc |↑ | 0.5922|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7789|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.6893|± |0.0037| | - humanities | 2|none | |acc |↑ | 0.6142|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5873|± |0.0440| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8242|± |0.0297| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8186|± |0.0270| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8650|± |0.0222| | - international_law | 1|none | 0|acc |↑ | 0.8099|± |0.0358| | - jurisprudence | 1|none | 0|acc |↑ | 0.8056|± |0.0383| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8221|± |0.0300| | - moral_disputes | 1|none | 0|acc |↑ | 0.7399|± |0.0236| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3419|± |0.0159| | - philosophy | 1|none | 0|acc |↑ | 0.7814|± |0.0235| | - prehistory | 1|none | 0|acc |↑ | 0.7716|± |0.0234| | - professional_law | 1|none | 0|acc |↑ | 0.5156|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8363|± |0.0284| | - other | 2|none | |acc |↑ | 0.7451|± |0.0075| | - business_ethics | 1|none | 0|acc |↑ | 0.7700|± |0.0423| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7358|± |0.0271| | - college_medicine | 1|none | 0|acc |↑ | 0.6821|± |0.0355| | - global_facts | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - human_aging | 1|none | 0|acc |↑ | 0.7489|± |0.0291| | - management | 1|none | 0|acc |↑ | 0.8155|± |0.0384| | - marketing | 1|none | 0|acc |↑ | 0.9231|± |0.0175| | - medical_genetics | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - miscellaneous | 1|none | 0|acc |↑ | 0.8506|± |0.0127| | - nutrition | 1|none | 0|acc |↑ | 0.7680|± |0.0242| | - professional_accounting | 1|none | 0|acc |↑ | 0.5603|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.7243|± |0.0271| | - virology | 1|none | 0|acc |↑ | 0.5060|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7956|± |0.0071| | - econometrics | 1|none | 0|acc |↑ | 0.5526|± |0.0468| | - high_school_geography | 1|none | 0|acc |↑ | 0.8485|± |0.0255| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9119|± |0.0205| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7538|± |0.0218| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8529|± |0.0230| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8716|± |0.0143| | - human_sexuality | 1|none | 0|acc |↑ | 0.7786|± |0.0364| | - professional_psychology | 1|none | 0|acc |↑ | 0.7288|± |0.0180| | - public_relations | 1|none | 0|acc |↑ | 0.6818|± |0.0446| | - security_studies | 1|none | 0|acc |↑ | 0.7633|± |0.0272| | - sociology | 1|none | 0|acc |↑ | 0.8308|± |0.0265| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.9200|± |0.0273| | - stem | 2|none | |acc |↑ | 0.6426|± |0.0082| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - anatomy | 1|none | 0|acc |↑ | 0.6889|± |0.0400| | - astronomy | 1|none | 0|acc |↑ | 0.7632|± |0.0346| | - college_biology | 1|none | 0|acc |↑ | 0.8264|± |0.0317| | - college_chemistry | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - college_mathematics | 1|none | 0|acc |↑ | 0.5500|± |0.0500| | - college_physics | 1|none | 0|acc |↑ | 0.4706|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7447|± |0.0285| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6552|± |0.0396| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6270|± |0.0249| | - high_school_biology | 1|none | 0|acc |↑ | 0.8355|± |0.0211| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6601|± |0.0333| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4185|± |0.0301| | - high_school_physics | 1|none | 0|acc |↑ | 0.4503|± |0.0406| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6019|± |0.0334| | - machine_learning | 1|none | 0|acc |↑ | 0.5000|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1532|± |0.0060| |openbookqa | 1|none | 0|acc |↑ | 0.3580|± |0.0215| | | |none | 0|acc_norm |↑ | 0.4560|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7943|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8063|± |0.0092| |qnli | 1|none | 0|acc |↑ | 0.5087|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9580|± |0.0063| | | |none | 0|acc_norm |↑ | 0.9520|± |0.0068| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5438|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4296|± |0.0173| | | |none | 0|bleu_diff |↑ | 0.5808|± |0.8548| | | |none | 0|bleu_max |↑ |27.3910|± |0.8218| | | |none | 0|rouge1_acc |↑ | 0.4198|± |0.0173| | | |none | 0|rouge1_diff|↑ | 0.7303|± |1.0868| | | |none | 0|rouge1_max |↑ |52.5810|± |0.9006| | | |none | 0|rouge2_acc |↑ | 0.3635|± |0.0168| | | |none | 0|rouge2_diff|↑ |-0.1698|± |1.2219| | | |none | 0|rouge2_max |↑ |37.0078|± |1.0709| | | |none | 0|rougeL_acc |↑ | 0.4186|± |0.0173| | | |none | 0|rougeL_diff|↑ | 0.4753|± |1.0982| | | |none | 0|rougeL_max |↑ |49.7474|± |0.9202| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3219|± |0.0164| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4676|± |0.0149| |winogrande | 1|none | 0|acc |↑ | 0.7261|± |0.0125| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.7120|± |0.0052| |mmlu | 2|none | |acc |↑ |0.6893|± |0.0037| | - humanities | 2|none | |acc |↑ |0.6142|± |0.0066| | - other | 2|none | |acc |↑ |0.7451|± |0.0075| | - social sciences| 2|none | |acc |↑ |0.7956|± |0.0071| | - stem | 2|none | |acc |↑ |0.6426|± |0.0082| 01-ai_Yi-1.5-9B: 11h 43m 41s ✅ Benchmark completed for 01-ai_Yi-1.5-9B 🔥 Starting benchmark for 01-ai_Yi-1.5-6B, Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 8 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 9 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/01-ai_Yi-1.5-6B,), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (8) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4480|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4070|± |0.0155| |anli_r3 | 1|none | 0|acc |↑ | 0.4067|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.4667|± |0.0146| | | |none | 0|acc_norm |↑ | 0.4966|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.5755|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9200|± |0.0172| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5936|± |0.0360| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1000|± |0.0190| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4520|± |0.0315| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3400|± |0.0300| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7160|± |0.0286| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6280|± |0.0306| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8360|± |0.0235| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5890|± |0.0409| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4400|± |0.0315| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6236|± |0.0364| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4320|± |0.0314| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2480|± |0.0274| |boolq | 2|none | 0|acc |↑ | 0.8015|± |0.0070| |drop | 3|none | 0|em |↑ | 0.3668|± |0.0049| | | |none | 0|f1 |↑ | 0.3995|± |0.0049| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0859|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0404|± |0.0140| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1212|± |0.0233| | | |strict-match | 0|exact_match|↑ | 0.0354|± |0.0132| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2323|± |0.0301| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3586|± |0.0342| | | |none | 0|acc_norm |↑ | 0.3586|± |0.0342| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3232|± |0.0333| | | |none | 0|acc_norm |↑ | 0.3232|± |0.0333| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1886|± |0.0168| | | |strict-match | 0|exact_match|↑ | 0.0714|± |0.0110| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2198|± |0.0177| | | |strict-match | 0|exact_match|↑ | 0.1062|± |0.0132| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2601|± |0.0188| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3315|± |0.0202| | | |none | 0|acc_norm |↑ | 0.3315|± |0.0202| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1473|± |0.0168| | | |strict-match | 0|exact_match|↑ | 0.0714|± |0.0122| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1741|± |0.0179| | | |strict-match | 0|exact_match|↑ | 0.0871|± |0.0133| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2277|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3147|± |0.0220| | | |none | 0|acc_norm |↑ | 0.3147|± |0.0220| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2902|± |0.0215| | | |none | 0|acc_norm |↑ | 0.2902|± |0.0215| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.5262|± |0.0138| | | |strict-match | 5|exact_match|↑ | 0.5224|± |0.0138| |hellaswag | 1|none | 0|acc |↑ | 0.5668|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7541|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.6243|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5528|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.4841|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7576|± |0.0335| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7990|± |0.0281| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7806|± |0.0269| | - international_law | 1|none | 0|acc |↑ | 0.7851|± |0.0375| | - jurisprudence | 1|none | 0|acc |↑ | 0.7222|± |0.0433| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7423|± |0.0344| | - moral_disputes | 1|none | 0|acc |↑ | 0.6792|± |0.0251| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.7203|± |0.0255| | - prehistory | 1|none | 0|acc |↑ | 0.6883|± |0.0258| | - professional_law | 1|none | 0|acc |↑ | 0.4824|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.7836|± |0.0316| | - other | 2|none | |acc |↑ | 0.6849|± |0.0081| | - business_ethics | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6717|± |0.0289| | - college_medicine | 1|none | 0|acc |↑ | 0.6358|± |0.0367| | - global_facts | 1|none | 0|acc |↑ | 0.3800|± |0.0488| | - human_aging | 1|none | 0|acc |↑ | 0.6502|± |0.0320| | - management | 1|none | 0|acc |↑ | 0.8058|± |0.0392| | - marketing | 1|none | 0|acc |↑ | 0.8675|± |0.0222| | - medical_genetics | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - miscellaneous | 1|none | 0|acc |↑ | 0.8008|± |0.0143| | - nutrition | 1|none | 0|acc |↑ | 0.6895|± |0.0265| | - professional_accounting | 1|none | 0|acc |↑ | 0.4965|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6287|± |0.0293| | - virology | 1|none | 0|acc |↑ | 0.5181|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7407|± |0.0077| | - econometrics | 1|none | 0|acc |↑ | 0.4386|± |0.0467| | - high_school_geography | 1|none | 0|acc |↑ | 0.7929|± |0.0289| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8601|± |0.0250| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7026|± |0.0232| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8025|± |0.0259| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8367|± |0.0158| | - human_sexuality | 1|none | 0|acc |↑ | 0.6794|± |0.0409| | - professional_psychology | 1|none | 0|acc |↑ | 0.6552|± |0.0192| | - public_relations | 1|none | 0|acc |↑ | 0.6545|± |0.0455| | - security_studies | 1|none | 0|acc |↑ | 0.7061|± |0.0292| | - sociology | 1|none | 0|acc |↑ | 0.8159|± |0.0274| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.5576|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - anatomy | 1|none | 0|acc |↑ | 0.5852|± |0.0426| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.7222|± |0.0375| | - college_chemistry | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5400|± |0.0501| | - college_mathematics | 1|none | 0|acc |↑ | 0.4500|± |0.0500| | - college_physics | 1|none | 0|acc |↑ | 0.4020|± |0.0488| | - computer_security | 1|none | 0|acc |↑ | 0.7700|± |0.0423| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6596|± |0.0310| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6414|± |0.0400| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.5159|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7935|± |0.0230| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5567|± |0.0350| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3481|± |0.0290| | - high_school_physics | 1|none | 0|acc |↑ | 0.4238|± |0.0403| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4583|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.4375|± |0.0471| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1781|± |0.0064| |openbookqa | 1|none | 0|acc |↑ | 0.3200|± |0.0209| | | |none | 0|acc_norm |↑ | 0.4220|± |0.0221| |piqa | 1|none | 0|acc |↑ | 0.7856|± |0.0096| | | |none | 0|acc_norm |↑ | 0.8014|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.5986|± |0.0066| |sciq | 1|none | 0|acc |↑ | 0.9540|± |0.0066| | | |none | 0|acc_norm |↑ | 0.9410|± |0.0075| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.4952|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5129|± |0.0175| | | |none | 0|bleu_diff |↑ | 8.5287|± |1.1119| | | |none | 0|bleu_max |↑ |33.0037|± |0.8799| | | |none | 0|rouge1_acc |↑ | 0.4835|± |0.0175| | | |none | 0|rouge1_diff|↑ |12.2235|± |1.5196| | | |none | 0|rouge1_max |↑ |57.2896|± |1.0008| | | |none | 0|rouge2_acc |↑ | 0.4370|± |0.0174| | | |none | 0|rouge2_diff|↑ |12.2809|± |1.6323| | | |none | 0|rouge2_max |↑ |44.2123|± |1.2026| | | |none | 0|rougeL_acc |↑ | 0.4663|± |0.0175| | | |none | 0|rougeL_diff|↑ |12.0600|± |1.5359| | | |none | 0|rougeL_max |↑ |55.2797|± |1.0310| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2999|± |0.0160| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4408|± |0.0148| |winogrande | 1|none | 0|acc |↑ | 0.7206|± |0.0126| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5755|± |0.0055| |mmlu | 2|none | |acc |↑ |0.6243|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5528|± |0.0067| | - other | 2|none | |acc |↑ |0.6849|± |0.0081| | - social sciences| 2|none | |acc |↑ |0.7407|± |0.0077| | - stem | 2|none | |acc |↑ |0.5576|± |0.0085| 01-ai_Yi-1.5-6B,: 4h 28m 24s ✅ Benchmark completed for 01-ai_Yi-1.5-6B, 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct-1M, Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-7B-Instruct-1M,), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5850|± |0.0156| |anli_r2 | 1|none | 0|acc |↑ | 0.5330|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5567|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.5503|± |0.0145| | | |none | 0|acc_norm |↑ | 0.5853|± |0.0144| |bbh | 3|get-answer | |exact_match|↑ | 0.2772|± |0.0043| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5668|± |0.0363| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7200|± |0.0285| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.0480|± |0.0135| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.1480|± |0.0225| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.0040|± |0.0040| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4240|± |0.0313| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3320|± |0.0298| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8440|± |0.0230| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0040|± |0.0040| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5616|± |0.0412| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.0120|± |0.0069| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.2920|± |0.0288| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7247|± |0.0336| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2200|± |0.0263| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.0920|± |0.0183| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.1760|± |0.0241| |boolq | 2|none | 0|acc |↑ | 0.8526|± |0.0062| |drop | 3|none | 0|em |↑ | 0.0023|± |0.0005| | | |none | 0|f1 |↑ | 0.0570|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1111|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2374|± |0.0303| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3030|± |0.0327| | | |none | 0|acc_norm |↑ | 0.3030|± |0.0327| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2980|± |0.0326| | | |none | 0|acc_norm |↑ | 0.2980|± |0.0326| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1758|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1190|± |0.0139| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2143|± |0.0176| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1719|± |0.0178| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1138|± |0.0150| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2299|± |0.0199| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3326|± |0.0223| | | |none | 0|acc_norm |↑ | 0.3326|± |0.0223| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3393|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3393|± |0.0224| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8234|± |0.0105| | | |strict-match | 5|exact_match|↑ | 0.7953|± |0.0111| |hellaswag | 1|none | 0|acc |↑ | 0.5987|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7900|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.7166|± |0.0036| | - humanities | 2|none | |acc |↑ | 0.6361|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5000|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8182|± |0.0301| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8627|± |0.0242| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8650|± |0.0222| | - international_law | 1|none | 0|acc |↑ | 0.8347|± |0.0339| | - jurisprudence | 1|none | 0|acc |↑ | 0.7778|± |0.0402| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8344|± |0.0292| | - moral_disputes | 1|none | 0|acc |↑ | 0.7832|± |0.0222| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4201|± |0.0165| | - philosophy | 1|none | 0|acc |↑ | 0.7492|± |0.0246| | - prehistory | 1|none | 0|acc |↑ | 0.8086|± |0.0219| | - professional_law | 1|none | 0|acc |↑ | 0.5248|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8538|± |0.0271| | - other | 2|none | |acc |↑ | 0.7634|± |0.0074| | - business_ethics | 1|none | 0|acc |↑ | 0.7600|± |0.0429| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7736|± |0.0258| | - college_medicine | 1|none | 0|acc |↑ | 0.7052|± |0.0348| | - global_facts | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7220|± |0.0301| | - management | 1|none | 0|acc |↑ | 0.8641|± |0.0339| | - marketing | 1|none | 0|acc |↑ | 0.9188|± |0.0179| | - medical_genetics | 1|none | 0|acc |↑ | 0.8500|± |0.0359| | - miscellaneous | 1|none | 0|acc |↑ | 0.8493|± |0.0128| | - nutrition | 1|none | 0|acc |↑ | 0.7843|± |0.0236| | - professional_accounting | 1|none | 0|acc |↑ | 0.6064|± |0.0291| | - professional_medicine | 1|none | 0|acc |↑ | 0.7831|± |0.0250| | - virology | 1|none | 0|acc |↑ | 0.5000|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.8229|± |0.0068| | - econometrics | 1|none | 0|acc |↑ | 0.6754|± |0.0440| | - high_school_geography | 1|none | 0|acc |↑ | 0.8636|± |0.0245| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9223|± |0.0193| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7821|± |0.0209| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8655|± |0.0222| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8936|± |0.0132| | - human_sexuality | 1|none | 0|acc |↑ | 0.8092|± |0.0345| | - professional_psychology | 1|none | 0|acc |↑ | 0.7598|± |0.0173| | - public_relations | 1|none | 0|acc |↑ | 0.7091|± |0.0435| | - security_studies | 1|none | 0|acc |↑ | 0.7796|± |0.0265| | - sociology | 1|none | 0|acc |↑ | 0.8955|± |0.0216| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8800|± |0.0327| | - stem | 2|none | |acc |↑ | 0.6870|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - anatomy | 1|none | 0|acc |↑ | 0.6963|± |0.0397| | - astronomy | 1|none | 0|acc |↑ | 0.8158|± |0.0315| | - college_biology | 1|none | 0|acc |↑ | 0.8542|± |0.0295| | - college_chemistry | 1|none | 0|acc |↑ | 0.5500|± |0.0500| | - college_computer_science | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - college_mathematics | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - college_physics | 1|none | 0|acc |↑ | 0.5392|± |0.0496| | - computer_security | 1|none | 0|acc |↑ | 0.8000|± |0.0402| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7277|± |0.0291| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7172|± |0.0375| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6534|± |0.0245| | - high_school_biology | 1|none | 0|acc |↑ | 0.8806|± |0.0184| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6601|± |0.0333| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5630|± |0.0302| | - high_school_physics | 1|none | 0|acc |↑ | 0.5695|± |0.0404| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6898|± |0.0315| | - machine_learning | 1|none | 0|acc |↑ | 0.5625|± |0.0471| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1576|± |0.0061| |openbookqa | 1|none | 0|acc |↑ | 0.3580|± |0.0215| | | |none | 0|acc_norm |↑ | 0.4800|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.8009|± |0.0093| | | |none | 0|acc_norm |↑ | 0.8161|± |0.0090| |qnli | 1|none | 0|acc |↑ | 0.6782|± |0.0063| |sciq | 1|none | 0|acc |↑ | 0.9630|± |0.0060| | | |none | 0|acc_norm |↑ | 0.9500|± |0.0069| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.4205|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4908|± |0.0175| | | |none | 0|bleu_diff |↑ | 0.0819|± |0.3040| | | |none | 0|bleu_max |↑ |10.5933|± |0.4922| | | |none | 0|rouge1_acc |↑ | 0.5067|± |0.0175| | | |none | 0|rouge1_diff|↑ | 0.1716|± |0.4296| | | |none | 0|rouge1_max |↑ |31.0117|± |0.6908| | | |none | 0|rouge2_acc |↑ | 0.3953|± |0.0171| | | |none | 0|rouge2_diff|↑ |-0.6488|± |0.4941| | | |none | 0|rouge2_max |↑ |18.4106|± |0.6992| | | |none | 0|rougeL_acc |↑ | 0.4982|± |0.0175| | | |none | 0|rougeL_diff|↑ |-0.0253|± |0.4232| | | |none | 0|rougeL_max |↑ |27.9487|± |0.6871| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4259|± |0.0173| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.6001|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.7277|± |0.0125| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.2772|± |0.0043| |mmlu | 2|none | |acc |↑ |0.7166|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6361|± |0.0066| | - other | 2|none | |acc |↑ |0.7634|± |0.0074| | - social sciences| 2|none | |acc |↑ |0.8229|± |0.0068| | - stem | 2|none | |acc |↑ |0.6870|± |0.0080| Qwen_Qwen2.5-7B-Instruct-1M,: 11h 17m 22s ✅ Benchmark completed for Qwen_Qwen2.5-7B-Instruct-1M, 🔥 Starting benchmark for Qwen_Qwen3-8B, Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen3-8B,), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (1) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6690|± |0.0149| |anli_r2 | 1|none | 0|acc |↑ | 0.5420|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5558|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.5546|± |0.0145| | | |none | 0|acc_norm |↑ | 0.5623|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.7976|± |0.0045| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9800|± |0.0089| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5936|± |0.0360| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8520|± |0.0225| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5960|± |0.0311| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.4440|± |0.0315| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5920|± |0.0311| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7160|± |0.0286| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4080|± |0.0311| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9640|± |0.0118| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9840|± |0.0080| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.9000|± |0.0190| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.8836|± |0.0266| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7960|± |0.0255| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5880|± |0.0312| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7079|± |0.0342| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.9280|± |0.0164| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.9640|± |0.0118| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.8960|± |0.0193| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.6240|± |0.0307| |boolq | 2|none | 0|acc |↑ | 0.8657|± |0.0060| |drop | 3|none | 0|em |↑ | 0.0034|± |0.0006| | | |none | 0|f1 |↑ | 0.1099|± |0.0020| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1162|± |0.0228| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0354|± |0.0132| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2828|± |0.0321| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1227|± |0.0141| | | |strict-match | 0|exact_match|↑ | 0.0073|± |0.0037| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0733|± |0.0112| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2692|± |0.0190| | | |strict-match | 0|exact_match|↑ | 0.0037|± |0.0026| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3608|± |0.0206| | | |none | 0|acc_norm |↑ | 0.3608|± |0.0206| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3828|± |0.0208| | | |none | 0|acc_norm |↑ | 0.3828|± |0.0208| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1295|± |0.0159| | | |strict-match | 0|exact_match|↑ | 0.0134|± |0.0054| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0647|± |0.0116| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2634|± |0.0208| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3884|± |0.0231| | | |none | 0|acc_norm |↑ | 0.3884|± |0.0231| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3504|± |0.0226| | | |none | 0|acc_norm |↑ | 0.3504|± |0.0226| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8787|± |0.0090| | | |strict-match | 5|exact_match|↑ | 0.8726|± |0.0092| |hellaswag | 1|none | 0|acc |↑ | 0.5711|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7487|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.7290|± |0.0035| | - humanities | 2|none | |acc |↑ | 0.6383|± |0.0065| | - formal_logic | 1|none | 0|acc |↑ | 0.6032|± |0.0438| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8788|± |0.0255| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8824|± |0.0226| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8734|± |0.0216| | - international_law | 1|none | 0|acc |↑ | 0.8182|± |0.0352| | - jurisprudence | 1|none | 0|acc |↑ | 0.8056|± |0.0383| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8405|± |0.0288| | - moral_disputes | 1|none | 0|acc |↑ | 0.7399|± |0.0236| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4101|± |0.0164| | - philosophy | 1|none | 0|acc |↑ | 0.7878|± |0.0232| | - prehistory | 1|none | 0|acc |↑ | 0.8395|± |0.0204| | - professional_law | 1|none | 0|acc |↑ | 0.5111|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8655|± |0.0262| | - other | 2|none | |acc |↑ | 0.7702|± |0.0072| | - business_ethics | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7925|± |0.0250| | - college_medicine | 1|none | 0|acc |↑ | 0.7572|± |0.0327| | - global_facts | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - human_aging | 1|none | 0|acc |↑ | 0.7309|± |0.0298| | - management | 1|none | 0|acc |↑ | 0.8835|± |0.0318| | - marketing | 1|none | 0|acc |↑ | 0.9274|± |0.0170| | - medical_genetics | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - miscellaneous | 1|none | 0|acc |↑ | 0.8557|± |0.0126| | - nutrition | 1|none | 0|acc |↑ | 0.7810|± |0.0237| | - professional_accounting | 1|none | 0|acc |↑ | 0.5745|± |0.0295| | - professional_medicine | 1|none | 0|acc |↑ | 0.8199|± |0.0233| | - virology | 1|none | 0|acc |↑ | 0.5422|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.8294|± |0.0067| | - econometrics | 1|none | 0|acc |↑ | 0.6754|± |0.0440| | - high_school_geography | 1|none | 0|acc |↑ | 0.8535|± |0.0252| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9326|± |0.0181| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7949|± |0.0205| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.9160|± |0.0180| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9064|± |0.0125| | - human_sexuality | 1|none | 0|acc |↑ | 0.8473|± |0.0315| | - professional_psychology | 1|none | 0|acc |↑ | 0.7533|± |0.0174| | - public_relations | 1|none | 0|acc |↑ | 0.7091|± |0.0435| | - security_studies | 1|none | 0|acc |↑ | 0.7755|± |0.0267| | - sociology | 1|none | 0|acc |↑ | 0.8856|± |0.0225| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.7257|± |0.0077| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - anatomy | 1|none | 0|acc |↑ | 0.7037|± |0.0394| | - astronomy | 1|none | 0|acc |↑ | 0.8684|± |0.0275| | - college_biology | 1|none | 0|acc |↑ | 0.8542|± |0.0295| | - college_chemistry | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_computer_science | 1|none | 0|acc |↑ | 0.7300|± |0.0446| | - college_mathematics | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_physics | 1|none | 0|acc |↑ | 0.5686|± |0.0493| | - computer_security | 1|none | 0|acc |↑ | 0.8200|± |0.0386| | - conceptual_physics | 1|none | 0|acc |↑ | 0.8255|± |0.0248| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7310|± |0.0370| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.7011|± |0.0236| | - high_school_biology | 1|none | 0|acc |↑ | 0.9129|± |0.0160| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.7241|± |0.0314| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8800|± |0.0327| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5074|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.7020|± |0.0373| | - high_school_statistics | 1|none | 0|acc |↑ | 0.7315|± |0.0302| | - machine_learning | 1|none | 0|acc |↑ | 0.5893|± |0.0467| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0737|± |0.0043| |openbookqa | 1|none | 0|acc |↑ | 0.3160|± |0.0208| | | |none | 0|acc_norm |↑ | 0.4180|± |0.0221| |piqa | 1|none | 0|acc |↑ | 0.7644|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7753|± |0.0097| |qnli | 1|none | 0|acc |↑ | 0.7818|± |0.0056| |sciq | 1|none | 0|acc |↑ | 0.9670|± |0.0057| | | |none | 0|acc_norm |↑ | 0.9580|± |0.0063| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3206|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.6022|± |0.0171| | | |none | 0|bleu_diff |↑ |16.3978|± |1.0966| | | |none | 0|bleu_max |↑ |35.5543|± |0.8937| | | |none | 0|rouge1_acc |↑ | 0.6083|± |0.0171| | | |none | 0|rouge1_diff|↑ |23.2877|± |1.5315| | | |none | 0|rouge1_max |↑ |62.0759|± |0.9426| | | |none | 0|rouge2_acc |↑ | 0.5704|± |0.0173| | | |none | 0|rouge2_diff|↑ |23.9536|± |1.6740| | | |none | 0|rouge2_max |↑ |50.5507|± |1.1698| | | |none | 0|rougeL_acc |↑ | 0.6120|± |0.0171| | | |none | 0|rougeL_diff|↑ |23.4988|± |1.5413| | | |none | 0|rougeL_max |↑ |59.9677|± |0.9917| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3635|± |0.0168| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5431|± |0.0158| |winogrande | 1|none | 0|acc |↑ | 0.6803|± |0.0131| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.7976|± |0.0045| |mmlu | 2|none | |acc |↑ |0.7290|± |0.0035| | - humanities | 2|none | |acc |↑ |0.6383|± |0.0065| | - other | 2|none | |acc |↑ |0.7702|± |0.0072| | - social sciences| 2|none | |acc |↑ |0.8294|± |0.0067| | - stem | 2|none | |acc |↑ |0.7257|± |0.0077| Qwen_Qwen3-8B,: 15h 32m 7s ✅ Benchmark completed for Qwen_Qwen3-8B, 🔥 Starting benchmark for Qwen_Qwen3-8B-FP8, Passed argument batch_size = auto:1. Detecting largest batch size Qwen_Qwen3-8B-FP8,: 0h 5m 29s ✅ Benchmark completed for Qwen_Qwen3-8B-FP8, 🔥 Starting benchmark for Qwen_Qwen2.5-Math-7B-Instruct, Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 4 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 4 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-Math-7B-Instruct,), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (4) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4310|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4150|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.4292|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.4061|± |0.0144| | | |none | 0|acc_norm |↑ | 0.4309|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.6140|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4545|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6680|± |0.0298| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0480|± |0.0135| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4480|± |0.0315| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6520|± |0.0302| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5440|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3960|± |0.0310| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9320|± |0.0160| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.7960|± |0.0255| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7671|± |0.0351| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8200|± |0.0243| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3240|± |0.0297| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5674|± |0.0372| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.6600|± |0.0300| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9280|± |0.0164| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0840|± |0.0176| |boolq | 2|none | 0|acc |↑ | 0.6061|± |0.0085| |drop | 3|none | 0|em |↑ | 0.0001|± |0.0001| | | |none | 0|f1 |↑ | 0.0273|± |0.0008| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1010|± |0.0215| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0909|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0960|± |0.0210| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3182|± |0.0332| | | |none | 0|acc_norm |↑ | 0.3182|± |0.0332| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3182|± |0.0332| | | |none | 0|acc_norm |↑ | 0.3182|± |0.0332| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1154|± |0.0137| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1227|± |0.0141| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1136|± |0.0136| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3278|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3278|± |0.0201| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3498|± |0.0204| | | |none | 0|acc_norm |↑ | 0.3498|± |0.0204| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1116|± |0.0149| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1205|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1295|± |0.0159| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3147|± |0.0220| | | |none | 0|acc_norm |↑ | 0.3147|± |0.0220| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2879|± |0.0214| | | |none | 0|acc_norm |↑ | 0.2879|± |0.0214| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8931|± |0.0085| | | |strict-match | 5|exact_match|↑ | 0.8901|± |0.0086| |hellaswag | 1|none | 0|acc |↑ | 0.4395|± |0.0050| | | |none | 0|acc_norm |↑ | 0.5881|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.5372|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4389|± |0.0070| | - formal_logic | 1|none | 0|acc |↑ | 0.4921|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.5818|± |0.0385| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5343|± |0.0350| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6160|± |0.0317| | - international_law | 1|none | 0|acc |↑ | 0.6529|± |0.0435| | - jurisprudence | 1|none | 0|acc |↑ | 0.5833|± |0.0477| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6196|± |0.0381| | - moral_disputes | 1|none | 0|acc |↑ | 0.5318|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2771|± |0.0150| | - philosophy | 1|none | 0|acc |↑ | 0.5563|± |0.0282| | - prehistory | 1|none | 0|acc |↑ | 0.5031|± |0.0278| | - professional_law | 1|none | 0|acc |↑ | 0.3677|± |0.0123| | - world_religions | 1|none | 0|acc |↑ | 0.4503|± |0.0382| | - other | 2|none | |acc |↑ | 0.5340|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5132|± |0.0308| | - college_medicine | 1|none | 0|acc |↑ | 0.4913|± |0.0381| | - global_facts | 1|none | 0|acc |↑ | 0.2400|± |0.0429| | - human_aging | 1|none | 0|acc |↑ | 0.5471|± |0.0334| | - management | 1|none | 0|acc |↑ | 0.7379|± |0.0435| | - marketing | 1|none | 0|acc |↑ | 0.7607|± |0.0280| | - medical_genetics | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - miscellaneous | 1|none | 0|acc |↑ | 0.6054|± |0.0175| | - nutrition | 1|none | 0|acc |↑ | 0.5065|± |0.0286| | - professional_accounting | 1|none | 0|acc |↑ | 0.4113|± |0.0294| | - professional_medicine | 1|none | 0|acc |↑ | 0.3934|± |0.0297| | - virology | 1|none | 0|acc |↑ | 0.4578|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.6233|± |0.0086| | - econometrics | 1|none | 0|acc |↑ | 0.5263|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.6111|± |0.0347| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6114|± |0.0352| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6410|± |0.0243| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7479|± |0.0282| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7193|± |0.0193| | - human_sexuality | 1|none | 0|acc |↑ | 0.4962|± |0.0439| | - professional_psychology | 1|none | 0|acc |↑ | 0.5049|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5818|± |0.0472| | - security_studies | 1|none | 0|acc |↑ | 0.6286|± |0.0309| | - sociology | 1|none | 0|acc |↑ | 0.6866|± |0.0328| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - stem | 2|none | |acc |↑ | 0.6032|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.4519|± |0.0430| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.5000|± |0.0418| | - college_chemistry | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - college_mathematics | 1|none | 0|acc |↑ | 0.4900|± |0.0502| | - college_physics | 1|none | 0|acc |↑ | 0.4412|± |0.0494| | - computer_security | 1|none | 0|acc |↑ | 0.6500|± |0.0479| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7234|± |0.0292| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6276|± |0.0403| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6958|± |0.0237| | - high_school_biology | 1|none | 0|acc |↑ | 0.6903|± |0.0263| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6059|± |0.0344| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5444|± |0.0304| | - high_school_physics | 1|none | 0|acc |↑ | 0.5497|± |0.0406| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6852|± |0.0317| | - machine_learning | 1|none | 0|acc |↑ | 0.4464|± |0.0472| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0199|± |0.0023| |openbookqa | 1|none | 0|acc |↑ | 0.2380|± |0.0191| | | |none | 0|acc_norm |↑ | 0.3340|± |0.0211| |piqa | 1|none | 0|acc |↑ | 0.6850|± |0.0108| | | |none | 0|acc_norm |↑ | 0.6855|± |0.0108| |qnli | 1|none | 0|acc |↑ | 0.6775|± |0.0063| |sciq | 1|none | 0|acc |↑ | 0.9110|± |0.0090| | | |none | 0|acc_norm |↑ | 0.8580|± |0.0110| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0075|± |0.0006| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3978|± |0.0171| | | |none | 0|bleu_diff |↑ | 0.7878|± |0.5709| | | |none | 0|bleu_max |↑ |17.7316|± |0.6218| | | |none | 0|rouge1_acc |↑ | 0.4211|± |0.0173| | | |none | 0|rouge1_diff|↑ | 1.8283|± |0.8791| | | |none | 0|rouge1_max |↑ |42.6331|± |0.8380| | | |none | 0|rouge2_acc |↑ | 0.3537|± |0.0167| | | |none | 0|rouge2_diff|↑ | 1.4478|± |0.9688| | | |none | 0|rouge2_max |↑ |29.6721|± |0.9118| | | |none | 0|rougeL_acc |↑ | 0.4039|± |0.0172| | | |none | 0|rougeL_diff|↑ | 1.5558|± |0.8801| | | |none | 0|rougeL_max |↑ |40.0785|± |0.8407| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2987|± |0.0160| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4750|± |0.0160| |winogrande | 1|none | 0|acc |↑ | 0.5793|± |0.0139| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6140|± |0.0051| |mmlu | 2|none | |acc |↑ |0.5372|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4389|± |0.0070| | - other | 2|none | |acc |↑ |0.5340|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.6233|± |0.0086| | - stem | 2|none | |acc |↑ |0.6032|± |0.0086| Qwen_Qwen2.5-Math-7B-Instruct,: 5h 37m 20s ✅ Benchmark completed for Qwen_Qwen2.5-Math-7B-Instruct, 🔥 Starting benchmark for Qwen_Qwen2.5-Math-7B, Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 4 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 4 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-Math-7B,), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto (4) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3870|± |0.0154| |anli_r2 | 1|none | 0|acc |↑ | 0.4070|± |0.0155| |anli_r3 | 1|none | 0|acc |↑ | 0.3825|± |0.0140| |arc_challenge | 1|none | 0|acc |↑ | 0.4855|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5026|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.6724|± |0.0050| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9120|± |0.0180| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5027|± |0.0367| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5080|± |0.0317| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1680|± |0.0237| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5520|± |0.0315| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4440|± |0.0315| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8680|± |0.0215| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7808|± |0.0344| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8840|± |0.0203| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.5040|± |0.0317| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3480|± |0.0302| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5506|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7400|± |0.0278| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.3080|± |0.0293| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.8960|± |0.0193| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.8520|± |0.0225| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| |boolq | 2|none | 0|acc |↑ | 0.7456|± |0.0076| |drop | 3|none | 0|em |↑ | 0.0012|± |0.0003| | | |none | 0|f1 |↑ | 0.0432|± |0.0011| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2121|± |0.0291| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2778|± |0.0319| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3333|± |0.0336| | | |none | 0|acc_norm |↑ | 0.3333|± |0.0336| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3182|± |0.0332| | | |none | 0|acc_norm |↑ | 0.3182|± |0.0332| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2161|± |0.0176| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2692|± |0.0190| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3168|± |0.0199| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3040|± |0.0197| | | |none | 0|acc_norm |↑ | 0.3040|± |0.0197| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3297|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3297|± |0.0201| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2188|± |0.0196| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2455|± |0.0204| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3237|± |0.0221| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3036|± |0.0217| | | |none | 0|acc_norm |↑ | 0.3036|± |0.0217| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3080|± |0.0218| | | |none | 0|acc_norm |↑ | 0.3080|± |0.0218| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8491|± |0.0099| | | |strict-match | 5|exact_match|↑ | 0.8476|± |0.0099| |hellaswag | 1|none | 0|acc |↑ | 0.4907|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6529|± |0.0048| |mmlu | 2|none | |acc |↑ | 0.5799|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4742|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.5159|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6606|± |0.0370| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6225|± |0.0340| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6751|± |0.0305| | - international_law | 1|none | 0|acc |↑ | 0.6942|± |0.0421| | - jurisprudence | 1|none | 0|acc |↑ | 0.6296|± |0.0467| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7055|± |0.0358| | - moral_disputes | 1|none | 0|acc |↑ | 0.6185|± |0.0262| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.6206|± |0.0276| | - prehistory | 1|none | 0|acc |↑ | 0.5216|± |0.0278| | - professional_law | 1|none | 0|acc |↑ | 0.3963|± |0.0125| | - world_religions | 1|none | 0|acc |↑ | 0.5965|± |0.0376| | - other | 2|none | |acc |↑ | 0.5845|± |0.0086| | - business_ethics | 1|none | 0|acc |↑ | 0.6400|± |0.0482| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5811|± |0.0304| | - college_medicine | 1|none | 0|acc |↑ | 0.5029|± |0.0381| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.6278|± |0.0324| | - management | 1|none | 0|acc |↑ | 0.6990|± |0.0454| | - marketing | 1|none | 0|acc |↑ | 0.8248|± |0.0249| | - medical_genetics | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - miscellaneous | 1|none | 0|acc |↑ | 0.6782|± |0.0167| | - nutrition | 1|none | 0|acc |↑ | 0.5654|± |0.0284| | - professional_accounting | 1|none | 0|acc |↑ | 0.4681|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.3713|± |0.0293| | - virology | 1|none | 0|acc |↑ | 0.4699|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.6724|± |0.0084| | - econometrics | 1|none | 0|acc |↑ | 0.5877|± |0.0463| | - high_school_geography | 1|none | 0|acc |↑ | 0.7020|± |0.0326| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.7254|± |0.0322| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6744|± |0.0238| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8193|± |0.0250| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7266|± |0.0191| | - human_sexuality | 1|none | 0|acc |↑ | 0.5878|± |0.0432| | - professional_psychology | 1|none | 0|acc |↑ | 0.5621|± |0.0201| | - public_relations | 1|none | 0|acc |↑ | 0.6182|± |0.0465| | - security_studies | 1|none | 0|acc |↑ | 0.6367|± |0.0308| | - sociology | 1|none | 0|acc |↑ | 0.7413|± |0.0310| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - stem | 2|none | |acc |↑ | 0.6429|± |0.0084| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.4370|± |0.0428| | - astronomy | 1|none | 0|acc |↑ | 0.7105|± |0.0369| | - college_biology | 1|none | 0|acc |↑ | 0.6389|± |0.0402| | - college_chemistry | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - college_mathematics | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - college_physics | 1|none | 0|acc |↑ | 0.4608|± |0.0496| | - computer_security | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7447|± |0.0285| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6690|± |0.0392| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.7169|± |0.0232| | - high_school_biology | 1|none | 0|acc |↑ | 0.7581|± |0.0244| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6207|± |0.0341| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5630|± |0.0302| | - high_school_physics | 1|none | 0|acc |↑ | 0.6026|± |0.0400| | - high_school_statistics | 1|none | 0|acc |↑ | 0.7222|± |0.0305| | - machine_learning | 1|none | 0|acc |↑ | 0.4821|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0510|± |0.0037| |openbookqa | 1|none | 0|acc |↑ | 0.2720|± |0.0199| | | |none | 0|acc_norm |↑ | 0.3920|± |0.0219| |piqa | 1|none | 0|acc |↑ | 0.7285|± |0.0104| | | |none | 0|acc_norm |↑ | 0.7454|± |0.0102| |qnli | 1|none | 0|acc |↑ | 0.4981|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9410|± |0.0075| | | |none | 0|acc_norm |↑ | 0.9290|± |0.0081| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2183|± |0.0031| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3733|± |0.0169| | | |none | 0|bleu_diff |↑ |-1.3812|± |0.3689| | | |none | 0|bleu_max |↑ |13.2847|± |0.4547| | | |none | 0|rouge1_acc |↑ | 0.3733|± |0.0169| | | |none | 0|rouge1_diff|↑ |-2.1164|± |0.5749| | | |none | 0|rouge1_max |↑ |36.6193|± |0.7424| | | |none | 0|rouge2_acc |↑ | 0.3158|± |0.0163| | | |none | 0|rouge2_diff|↑ |-2.6990|± |0.6541| | | |none | 0|rouge2_max |↑ |24.5611|± |0.7453| | | |none | 0|rougeL_acc |↑ | 0.3733|± |0.0169| | | |none | 0|rougeL_diff|↑ |-2.1518|± |0.5742| | | |none | 0|rougeL_max |↑ |34.7410|± |0.7352| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3207|± |0.0163| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4832|± |0.0150| |winogrande | 1|none | 0|acc |↑ | 0.6480|± |0.0134| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6724|± |0.0050| |mmlu | 2|none | |acc |↑ |0.5799|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4742|± |0.0069| | - other | 2|none | |acc |↑ |0.5845|± |0.0086| | - social sciences| 2|none | |acc |↑ |0.6724|± |0.0084| | - stem | 2|none | |acc |↑ |0.6429|± |0.0084| Qwen_Qwen2.5-Math-7B,: 27h 22m 7s ✅ Benchmark completed for Qwen_Qwen2.5-Math-7B, ______________________ 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-7B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:5 (1,64,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6840|± |0.0147| |anli_r2 | 1|none | 0|acc |↑ | 0.5440|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5492|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5265|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5529|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4534|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8120|± |0.0248| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3529|± |0.0350| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0080|± |0.0056| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.1040|± |0.0193| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.2055|± |0.0336| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5560|± |0.0315| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2680|± |0.0281| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5281|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7760|± |0.0264| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.3200|± |0.0296| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| |boolq | 2|none | 0|acc |↑ | 0.8633|± |0.0060| |drop | 3|none | 0|em |↑ | 0.0028|± |0.0005| | | |none | 0|f1 |↑ | 0.0713|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0859|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2879|± |0.0323| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3131|± |0.0330| | | |none | 0|acc_norm |↑ | 0.3131|± |0.0330| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1612|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0989|± |0.0128| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3242|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3242|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1540|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1094|± |0.0148| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2746|± |0.0211| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3237|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3237|± |0.0221| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3326|± |0.0223| | | |none | 0|acc_norm |↑ | 0.3326|± |0.0223| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8256|± |0.0105| | | |strict-match | 5|exact_match|↑ | 0.7582|± |0.0118| |hellaswag | 1|none | 0|acc |↑ | 0.6198|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8039|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.7175|± |0.0036| | - humanities | 2|none | |acc |↑ | 0.6351|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5635|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8485|± |0.0280| | - high_school_us_history | 1|none | 0|acc |↑ | 0.9020|± |0.0209| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8734|± |0.0216| | - international_law | 1|none | 0|acc |↑ | 0.8099|± |0.0358| | - jurisprudence | 1|none | 0|acc |↑ | 0.7963|± |0.0389| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8221|± |0.0300| | - moral_disputes | 1|none | 0|acc |↑ | 0.7659|± |0.0228| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4190|± |0.0165| | - philosophy | 1|none | 0|acc |↑ | 0.7363|± |0.0250| | - prehistory | 1|none | 0|acc |↑ | 0.8241|± |0.0212| | - professional_law | 1|none | 0|acc |↑ | 0.5156|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8246|± |0.0292| | - other | 2|none | |acc |↑ | 0.7650|± |0.0073| | - business_ethics | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7774|± |0.0256| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - human_aging | 1|none | 0|acc |↑ | 0.7803|± |0.0278| | - management | 1|none | 0|acc |↑ | 0.8835|± |0.0318| | - marketing | 1|none | 0|acc |↑ | 0.9231|± |0.0175| | - medical_genetics | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - miscellaneous | 1|none | 0|acc |↑ | 0.8531|± |0.0127| | - nutrition | 1|none | 0|acc |↑ | 0.7876|± |0.0234| | - professional_accounting | 1|none | 0|acc |↑ | 0.5567|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.7831|± |0.0250| | - virology | 1|none | 0|acc |↑ | 0.5241|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.8274|± |0.0067| | - econometrics | 1|none | 0|acc |↑ | 0.6754|± |0.0440| | - high_school_geography | 1|none | 0|acc |↑ | 0.8788|± |0.0233| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9430|± |0.0167| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7897|± |0.0207| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8655|± |0.0222| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9064|± |0.0125| | - human_sexuality | 1|none | 0|acc |↑ | 0.8092|± |0.0345| | - professional_psychology | 1|none | 0|acc |↑ | 0.7598|± |0.0173| | - public_relations | 1|none | 0|acc |↑ | 0.7182|± |0.0431| | - security_studies | 1|none | 0|acc |↑ | 0.7714|± |0.0269| | - sociology | 1|none | 0|acc |↑ | 0.8955|± |0.0216| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.6863|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - anatomy | 1|none | 0|acc |↑ | 0.7333|± |0.0382| | - astronomy | 1|none | 0|acc |↑ | 0.8553|± |0.0286| | - college_biology | 1|none | 0|acc |↑ | 0.8681|± |0.0283| | - college_chemistry | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_mathematics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_physics | 1|none | 0|acc |↑ | 0.5098|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7702|± |0.0275| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7103|± |0.0378| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6720|± |0.0242| | - high_school_biology | 1|none | 0|acc |↑ | 0.8677|± |0.0193| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6404|± |0.0338| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5481|± |0.0303| | - high_school_physics | 1|none | 0|acc |↑ | 0.5762|± |0.0403| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6806|± |0.0318| | - machine_learning | 1|none | 0|acc |↑ | 0.5268|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0457|± |0.0035| |openbookqa | 1|none | 0|acc |↑ | 0.3440|± |0.0213| | | |none | 0|acc_norm |↑ | 0.4860|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.7943|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8009|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.8047|± |0.0054| |sciq | 1|none | 0|acc |↑ | 0.9560|± |0.0065| | | |none | 0|acc_norm |↑ | 0.9360|± |0.0077| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3251|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5129|± |0.0175| | | |none | 0|bleu_diff |↑ | 0.3578|± |0.2345| | | |none | 0|bleu_max |↑ | 8.0558|± |0.4296| | | |none | 0|rouge1_acc |↑ | 0.5373|± |0.0175| | | |none | 0|rouge1_diff|↑ | 1.0510|± |0.3423| | | |none | 0|rouge1_max |↑ |25.7407|± |0.6485| | | |none | 0|rouge2_acc |↑ | 0.4455|± |0.0174| | | |none | 0|rouge2_diff|↑ | 0.4063|± |0.3684| | | |none | 0|rouge2_max |↑ |15.2839|± |0.6032| | | |none | 0|rougeL_acc |↑ | 0.4884|± |0.0175| | | |none | 0|rougeL_diff|↑ | 0.4719|± |0.3267| | | |none | 0|rougeL_max |↑ |22.5918|± |0.6236| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4774|± |0.0175| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.6485|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.7080|± |0.0128| Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-7B-Instruct), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:5 (1,64,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6840|± |0.0147| |anli_r2 | 1|none | 0|acc |↑ | 0.5440|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5492|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5265|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5529|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4534|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8120|± |0.0248| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3529|± |0.0350| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0080|± |0.0056| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.1040|± |0.0193| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8800|± |0.0206| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.2055|± |0.0336| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5560|± |0.0315| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0360|± |0.0118| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2680|± |0.0281| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5281|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7760|± |0.0264| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.3200|± |0.0296| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| |boolq | 2|none | 0|acc |↑ | 0.8633|± |0.0060| |drop | 3|none | 0|em |↑ | 0.0028|± |0.0005| | | |none | 0|f1 |↑ | 0.0713|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0859|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2879|± |0.0323| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3131|± |0.0330| | | |none | 0|acc_norm |↑ | 0.3131|± |0.0330| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1612|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0989|± |0.0128| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3242|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3242|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1540|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1094|± |0.0148| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2746|± |0.0211| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3237|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3237|± |0.0221| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3326|± |0.0223| | | |none | 0|acc_norm |↑ | 0.3326|± |0.0223| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8256|± |0.0105| | | |strict-match | 5|exact_match|↑ | 0.7582|± |0.0118| |hellaswag | 1|none | 0|acc |↑ | 0.6198|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8039|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.7175|± |0.0036| | - humanities | 2|none | |acc |↑ | 0.6351|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5635|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8485|± |0.0280| | - high_school_us_history | 1|none | 0|acc |↑ | 0.9020|± |0.0209| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8734|± |0.0216| | - international_law | 1|none | 0|acc |↑ | 0.8099|± |0.0358| | - jurisprudence | 1|none | 0|acc |↑ | 0.7963|± |0.0389| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8221|± |0.0300| | - moral_disputes | 1|none | 0|acc |↑ | 0.7659|± |0.0228| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4190|± |0.0165| | - philosophy | 1|none | 0|acc |↑ | 0.7363|± |0.0250| | - prehistory | 1|none | 0|acc |↑ | 0.8241|± |0.0212| | - professional_law | 1|none | 0|acc |↑ | 0.5156|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8246|± |0.0292| | - other | 2|none | |acc |↑ | 0.7650|± |0.0073| | - business_ethics | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7774|± |0.0256| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - human_aging | 1|none | 0|acc |↑ | 0.7803|± |0.0278| | - management | 1|none | 0|acc |↑ | 0.8835|± |0.0318| | - marketing | 1|none | 0|acc |↑ | 0.9231|± |0.0175| | - medical_genetics | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - miscellaneous | 1|none | 0|acc |↑ | 0.8531|± |0.0127| | - nutrition | 1|none | 0|acc |↑ | 0.7876|± |0.0234| | - professional_accounting | 1|none | 0|acc |↑ | 0.5567|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.7831|± |0.0250| | - virology | 1|none | 0|acc |↑ | 0.5241|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.8274|± |0.0067| | - econometrics | 1|none | 0|acc |↑ | 0.6754|± |0.0440| | - high_school_geography | 1|none | 0|acc |↑ | 0.8788|± |0.0233| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9430|± |0.0167| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7897|± |0.0207| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8655|± |0.0222| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9064|± |0.0125| | - human_sexuality | 1|none | 0|acc |↑ | 0.8092|± |0.0345| | - professional_psychology | 1|none | 0|acc |↑ | 0.7598|± |0.0173| | - public_relations | 1|none | 0|acc |↑ | 0.7182|± |0.0431| | - security_studies | 1|none | 0|acc |↑ | 0.7714|± |0.0269| | - sociology | 1|none | 0|acc |↑ | 0.8955|± |0.0216| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.6863|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - anatomy | 1|none | 0|acc |↑ | 0.7333|± |0.0382| | - astronomy | 1|none | 0|acc |↑ | 0.8553|± |0.0286| | - college_biology | 1|none | 0|acc |↑ | 0.8681|± |0.0283| | - college_chemistry | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_mathematics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_physics | 1|none | 0|acc |↑ | 0.5098|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7800|± |0.0416| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7702|± |0.0275| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7103|± |0.0378| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6720|± |0.0242| | - high_school_biology | 1|none | 0|acc |↑ | 0.8677|± |0.0193| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6404|± |0.0338| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5481|± |0.0303| | - high_school_physics | 1|none | 0|acc |↑ | 0.5762|± |0.0403| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6806|± |0.0318| | - machine_learning | 1|none | 0|acc |↑ | 0.5268|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0457|± |0.0035| |openbookqa | 1|none | 0|acc |↑ | 0.3440|± |0.0213| | | |none | 0|acc_norm |↑ | 0.4860|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.7943|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8009|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.8047|± |0.0054| |sciq | 1|none | 0|acc |↑ | 0.9560|± |0.0065| | | |none | 0|acc_norm |↑ | 0.9360|± |0.0077| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3251|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5129|± |0.0175| | | |none | 0|bleu_diff |↑ | 0.3578|± |0.2345| | | |none | 0|bleu_max |↑ | 8.0558|± |0.4296| | | |none | 0|rouge1_acc |↑ | 0.5373|± |0.0175| | | |none | 0|rouge1_diff|↑ | 1.0510|± |0.3423| | | |none | 0|rouge1_max |↑ |25.7407|± |0.6485| | | |none | 0|rouge2_acc |↑ | 0.4455|± |0.0174| | | |none | 0|rouge2_diff|↑ | 0.4063|± |0.3684| | | |none | 0|rouge2_max |↑ |15.2839|± |0.6032| | | |none | 0|rougeL_acc |↑ | 0.4884|± |0.0175| | | |none | 0|rougeL_diff|↑ | 0.4719|± |0.3267| | | |none | 0|rougeL_max |↑ |22.5918|± |0.6236| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4774|± |0.0175| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.6485|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.7080|± |0.0128| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4534|± |0.0051| |mmlu | 2|none | |acc |↑ |0.7175|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6351|± |0.0066| | - other | 2|none | |acc |↑ |0.7650|± |0.0073| | - social sciences| 2|none | |acc |↑ |0.8274|± |0.0067| | - stem | 2|none | |acc |↑ |0.6863|± |0.0080| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4534|± |0.0051| |mmlu | 2|none | |acc |↑ |0.7175|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6351|± |0.0066| | - other | 2|none | |acc |↑ |0.7650|± |0.0073| | - social sciences| 2|none | |acc |↑ |0.8274|± |0.0067| | - stem | 2|none | |acc |↑ |0.6863|± |0.0080| Qwen_Qwen2.5-7B-Instruct: 11h 6m 29s ✅ Benchmark completed for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Llama-8B Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto:5.0. Detecting largest batch size Determined largest batch size: 64 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:5 (1,64,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4040|± |0.0155| |anli_r2 | 1|none | 0|acc |↑ | 0.4100|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.3883|± |0.0141| |arc_challenge | 1|none | 0|acc |↑ | 0.4061|± |0.0144| | | |none | 0|acc_norm |↑ | 0.4232|± |0.0144| |bbh | 3|get-answer | |exact_match|↑ | 0.6037|± |0.0050| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9400|± |0.0151| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5508|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7440|± |0.0277| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6120|± |0.0309| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.2360|± |0.0269| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.2520|± |0.0275| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3200|± |0.0296| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7920|± |0.0257| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2760|± |0.0283| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0600|± |0.0151| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8520|± |0.0225| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6840|± |0.0295| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.8840|± |0.0203| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8560|± |0.0222| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8320|± |0.0237| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.3973|± |0.0406| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6200|± |0.0308| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5506|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8200|± |0.0243| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1040|± |0.0193| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9480|± |0.0141| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.3960|± |0.0310| |boolq | 2|none | 0|acc |↑ | 0.8287|± |0.0066| |drop | 3|none | 0|em |↑ | 0.0031|± |0.0006| | | |none | 0|f1 |↑ | 0.0712|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0909|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0707|± |0.0183| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2626|± |0.0314| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2727|± |0.0317| | | |none | 0|acc_norm |↑ | 0.2727|± |0.0317| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1172|± |0.0138| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0952|± |0.0126| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3059|± |0.0197| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2912|± |0.0195| | | |none | 0|acc_norm |↑ | 0.2912|± |0.0195| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2967|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2967|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0893|± |0.0135| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1138|± |0.0150| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2746|± |0.0211| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3125|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3125|± |0.0219| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2746|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2746|± |0.0211| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6452|± |0.0132| | | |strict-match | 5|exact_match|↑ | 0.6247|± |0.0133| |hellaswag | 1|none | 0|acc |↑ | 0.5562|± |0.0050| | | |none | 0|acc_norm |↑ | 0.7430|± |0.0044| |mmlu | 2|none | |acc |↑ | 0.5327|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4767|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.4048|± |0.0439| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7030|± |0.0357| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6765|± |0.0328| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7384|± |0.0286| | - international_law | 1|none | 0|acc |↑ | 0.7025|± |0.0417| | - jurisprudence | 1|none | 0|acc |↑ | 0.6019|± |0.0473| | - logical_fallacies | 1|none | 0|acc |↑ | 0.5706|± |0.0389| | - moral_disputes | 1|none | 0|acc |↑ | 0.5289|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.5466|± |0.0283| | - prehistory | 1|none | 0|acc |↑ | 0.6327|± |0.0268| | - professional_law | 1|none | 0|acc |↑ | 0.4068|± |0.0125| | - world_religions | 1|none | 0|acc |↑ | 0.7076|± |0.0349| | - other | 2|none | |acc |↑ | 0.6041|± |0.0085| | - business_ethics | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5811|± |0.0304| | - college_medicine | 1|none | 0|acc |↑ | 0.5318|± |0.0380| | - global_facts | 1|none | 0|acc |↑ | 0.3800|± |0.0488| | - human_aging | 1|none | 0|acc |↑ | 0.5291|± |0.0335| | - management | 1|none | 0|acc |↑ | 0.7282|± |0.0441| | - marketing | 1|none | 0|acc |↑ | 0.7949|± |0.0265| | - medical_genetics | 1|none | 0|acc |↑ | 0.6300|± |0.0485| | - miscellaneous | 1|none | 0|acc |↑ | 0.7152|± |0.0161| | - nutrition | 1|none | 0|acc |↑ | 0.6144|± |0.0279| | - professional_accounting | 1|none | 0|acc |↑ | 0.4007|± |0.0292| | - professional_medicine | 1|none | 0|acc |↑ | 0.5625|± |0.0301| | - virology | 1|none | 0|acc |↑ | 0.4639|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.6074|± |0.0086| | - econometrics | 1|none | 0|acc |↑ | 0.3158|± |0.0437| | - high_school_geography | 1|none | 0|acc |↑ | 0.6465|± |0.0341| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6425|± |0.0346| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5154|± |0.0253| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4874|± |0.0325| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7009|± |0.0196| | - human_sexuality | 1|none | 0|acc |↑ | 0.6336|± |0.0423| | - professional_psychology | 1|none | 0|acc |↑ | 0.5539|± |0.0201| | - public_relations | 1|none | 0|acc |↑ | 0.6273|± |0.0463| | - security_studies | 1|none | 0|acc |↑ | 0.6367|± |0.0308| | - sociology | 1|none | 0|acc |↑ | 0.7363|± |0.0312| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8700|± |0.0338| | - stem | 2|none | |acc |↑ | 0.4729|± |0.0088| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - anatomy | 1|none | 0|acc |↑ | 0.5111|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.5461|± |0.0405| | - college_biology | 1|none | 0|acc |↑ | 0.5556|± |0.0416| | - college_chemistry | 1|none | 0|acc |↑ | 0.3800|± |0.0488| | - college_computer_science | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_physics | 1|none | 0|acc |↑ | 0.3725|± |0.0481| | - computer_security | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4681|± |0.0326| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5034|± |0.0417| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4286|± |0.0255| | - high_school_biology | 1|none | 0|acc |↑ | 0.6452|± |0.0272| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4483|± |0.0350| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3815|± |0.0296| | - high_school_physics | 1|none | 0|acc |↑ | 0.3907|± |0.0398| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4444|± |0.0339| | - machine_learning | 1|none | 0|acc |↑ | 0.4018|± |0.0465| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0584|± |0.0039| |openbookqa | 1|none | 0|acc |↑ | 0.3160|± |0.0208| | | |none | 0|acc_norm |↑ | 0.4100|± |0.0220| |piqa | 1|none | 0|acc |↑ | 0.7595|± |0.0100| | | |none | 0|acc_norm |↑ | 0.7758|± |0.0097| |qnli | 1|none | 0|acc |↑ | 0.5147|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9290|± |0.0081| | | |none | 0|acc_norm |↑ | 0.8990|± |0.0095| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1940|± |0.0030| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4455|± |0.0174| | | |none | 0|bleu_diff |↑ |-0.9197|± |0.5624| | | |none | 0|bleu_max |↑ |15.5776|± |0.6556| | | |none | 0|rouge1_acc |↑ | 0.4517|± |0.0174| | | |none | 0|rouge1_diff|↑ |-1.1087|± |0.7899| | | |none | 0|rouge1_max |↑ |37.4816|± |0.7944| | | |none | 0|rouge2_acc |↑ | 0.2521|± |0.0152| | | |none | 0|rouge2_diff|↑ |-4.0247|± |0.7868| | | |none | 0|rouge2_max |↑ |19.9717|± |0.8482| | | |none | 0|rougeL_acc |↑ | 0.4565|± |0.0174| | | |none | 0|rougeL_diff|↑ |-1.0779|± |0.7931| | | |none | 0|rougeL_max |↑ |34.9414|± |0.7851| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3219|± |0.0164| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5045|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.6780|± |0.0131| Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_DeepSeek-R1-Distill-Llama-8B), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:5 (1,64,64,64,64,64) | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6037|± |0.0050| |mmlu | 2|none | |acc |↑ |0.5327|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4767|± |0.0069| | - other | 2|none | |acc |↑ |0.6041|± |0.0085| | - social sciences| 2|none | |acc |↑ |0.6074|± |0.0086| | - stem | 2|none | |acc |↑ |0.4729|± |0.0088| deepseek-ai_DeepSeek-R1-Distill-Llama-8B: 11h 46m 55s ✅ Benchmark completed for deepseek-ai_DeepSeek-R1-Distill-Llama-8B 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen-7B-Chat Qwen_Qwen-7B-Chat: 0h 0m 3s ✅ Benchmark completed for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B 🔥 Starting benchmark for Qwen_Qwen-7B-Chat Qwen_Qwen-7B-Chat: 0h 0m 4s ✅ Benchmark completed for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B 🔥 Starting benchmark for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B 🔥 Starting benchmark for meta-llama_Meta-Llama-3-8B-Instruct 🔥 Starting benchmark for mistralai_Mistral-7B-Instruct-v0.3 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-7B-Instruct hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-7B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6850|± |0.0147| |anli_r2 | 1|none | 0|acc |↑ | 0.5490|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.5525|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5239|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5529|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4488|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8160|± |0.0246| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3209|± |0.0342| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4520|± |0.0315| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0080|± |0.0056| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1960|± |0.0252| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.0960|± |0.0187| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.4120|± |0.0312| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4000|± |0.0310| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1600|± |0.0232| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.2600|± |0.0278| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.1918|± |0.0327| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5520|± |0.0315| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0480|± |0.0135| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2080|± |0.0257| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5281|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7840|± |0.0261| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7480|± |0.0275| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5480|± |0.0315| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2720|± |0.0282| |boolq | 2|none | 0|acc |↑ | 0.8633|± |0.0060| |drop | 3|none | 0|em |↑ | 0.0025|± |0.0005| | | |none | 0|f1 |↑ | 0.0711|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1465|± |0.0252| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0859|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2879|± |0.0323| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3081|± |0.0329| | | |none | 0|acc_norm |↑ | 0.3081|± |0.0329| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1410|± |0.0149| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1007|± |0.0129| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2564|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3242|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3242|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1429|± |0.0166| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1027|± |0.0144| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2946|± |0.0216| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3214|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3214|± |0.0221| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3281|± |0.0222| | | |none | 0|acc_norm |↑ | 0.3281|± |0.0222| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8256|± |0.0105| | | |strict-match | 5|exact_match|↑ | 0.7627|± |0.0117| |hellaswag | 1|none | 0|acc |↑ | 0.6194|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8049|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.7181|± |0.0036| | - humanities | 2|none | |acc |↑ | 0.6372|± |0.0066| | - formal_logic | 1|none | 0|acc |↑ | 0.5714|± |0.0443| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8485|± |0.0280| | - high_school_us_history | 1|none | 0|acc |↑ | 0.9020|± |0.0209| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8734|± |0.0216| | - international_law | 1|none | 0|acc |↑ | 0.8182|± |0.0352| | - jurisprudence | 1|none | 0|acc |↑ | 0.8056|± |0.0383| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8221|± |0.0300| | - moral_disputes | 1|none | 0|acc |↑ | 0.7659|± |0.0228| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4235|± |0.0165| | - philosophy | 1|none | 0|acc |↑ | 0.7395|± |0.0249| | - prehistory | 1|none | 0|acc |↑ | 0.8272|± |0.0210| | - professional_law | 1|none | 0|acc |↑ | 0.5156|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8304|± |0.0288| | - other | 2|none | |acc |↑ | 0.7647|± |0.0073| | - business_ethics | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7774|± |0.0256| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7803|± |0.0278| | - management | 1|none | 0|acc |↑ | 0.8932|± |0.0306| | - marketing | 1|none | 0|acc |↑ | 0.9188|± |0.0179| | - medical_genetics | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - miscellaneous | 1|none | 0|acc |↑ | 0.8519|± |0.0127| | - nutrition | 1|none | 0|acc |↑ | 0.7843|± |0.0236| | - professional_accounting | 1|none | 0|acc |↑ | 0.5567|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.7831|± |0.0250| | - virology | 1|none | 0|acc |↑ | 0.5181|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.8284|± |0.0067| | - econometrics | 1|none | 0|acc |↑ | 0.6579|± |0.0446| | - high_school_geography | 1|none | 0|acc |↑ | 0.8788|± |0.0233| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9378|± |0.0174| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7949|± |0.0205| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8697|± |0.0219| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9064|± |0.0125| | - human_sexuality | 1|none | 0|acc |↑ | 0.8092|± |0.0345| | - professional_psychology | 1|none | 0|acc |↑ | 0.7647|± |0.0172| | - public_relations | 1|none | 0|acc |↑ | 0.7182|± |0.0431| | - security_studies | 1|none | 0|acc |↑ | 0.7714|± |0.0269| | - sociology | 1|none | 0|acc |↑ | 0.8955|± |0.0216| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.6851|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - anatomy | 1|none | 0|acc |↑ | 0.7407|± |0.0379| | - astronomy | 1|none | 0|acc |↑ | 0.8553|± |0.0286| | - college_biology | 1|none | 0|acc |↑ | 0.8681|± |0.0283| | - college_chemistry | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_mathematics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_physics | 1|none | 0|acc |↑ | 0.5098|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7700|± |0.0423| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7660|± |0.0277| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7172|± |0.0375| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6746|± |0.0241| | - high_school_biology | 1|none | 0|acc |↑ | 0.8645|± |0.0195| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6404|± |0.0338| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5407|± |0.0304| | - high_school_physics | 1|none | 0|acc |↑ | 0.5695|± |0.0404| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6759|± |0.0319| | - machine_learning | 1|none | 0|acc |↑ | 0.5268|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0457|± |0.0035| |openbookqa | 1|none | 0|acc |↑ | 0.3420|± |0.0212| | | |none | 0|acc_norm |↑ | 0.4860|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.7960|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8030|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.8045|± |0.0054| |sciq | 1|none | 0|acc |↑ | 0.9560|± |0.0065| | | |none | 0|acc_norm |↑ | 0.9370|± |0.0077| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3254|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5006|± |0.0175| | | |none | 0|bleu_diff |↑ | 0.3019|± |0.2337| | | |none | 0|bleu_max |↑ | 7.9281|± |0.4300| | | |none | 0|rouge1_acc |↑ | 0.5386|± |0.0175| | | |none | 0|rouge1_diff|↑ | 0.9126|± |0.3421| | | |none | 0|rouge1_max |↑ |25.3456|± |0.6494| | | |none | 0|rouge2_acc |↑ | 0.4455|± |0.0174| | | |none | 0|rouge2_diff|↑ | 0.2550|± |0.3623| | | |none | 0|rouge2_max |↑ |15.0210|± |0.6025| | | |none | 0|rougeL_acc |↑ | 0.4847|± |0.0175| | | |none | 0|rougeL_diff|↑ | 0.3002|± |0.3241| | | |none | 0|rougeL_max |↑ |22.2349|± |0.6237| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4774|± |0.0175| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.6485|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.7119|± |0.0127| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4488|± |0.0051| |mmlu | 2|none | |acc |↑ |0.7181|± |0.0036| | - humanities | 2|none | |acc |↑ |0.6372|± |0.0066| | - other | 2|none | |acc |↑ |0.7647|± |0.0073| | - social sciences| 2|none | |acc |↑ |0.8284|± |0.0067| | - stem | 2|none | |acc |↑ |0.6851|± |0.0080| Qwen_Qwen2.5-7B-Instruct: 9h 37m 6s ✅ Benchmark completed for Qwen_Qwen2.5-7B-Instruct 🔥 Starting benchmark for Qwen_Qwen-7B-Chat Qwen_Qwen-7B-Chat: 0h 5m 2s ✅ Benchmark completed for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B Qwen_Qwen-7B: 0h 5m 1s ✅ Benchmark completed for Qwen_Qwen-7B 🔥 Starting benchmark for meta-llama_Meta-Llama-3-8B-Instruct hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/meta-llama_Meta-Llama-3-8B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4840|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4580|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4483|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5316|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5640|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.6790|± |0.0053| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8960|± |0.0193| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5668|± |0.0363| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0880|± |0.0180| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4920|± |0.0317| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7877|± |0.0340| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7400|± |0.0278| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7135|± |0.0340| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.6720|± |0.0298| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.5280|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.7680|± |0.0268| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| |boolq | 2|none | 0|acc |↑ | 0.8312|± |0.0066| |drop | 3|none | 0|em |↑ | 0.0290|± |0.0017| | | |none | 0|f1 |↑ | 0.1640|± |0.0024| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1515|± |0.0255| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2778|± |0.0319| | | |none | 0|acc_norm |↑ | 0.2778|± |0.0319| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1923|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1575|± |0.0156| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2619|± |0.0188| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2967|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2967|± |0.0196| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3260|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3260|± |0.0201| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1830|± |0.0183| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1429|± |0.0166| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2790|± |0.0212| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3058|± |0.0218| | | |none | 0|acc_norm |↑ | 0.3058|± |0.0218| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3103|± |0.0219| | | |none | 0|acc_norm |↑ | 0.3103|± |0.0219| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7544|± |0.0119| | | |strict-match | 5|exact_match|↑ | 0.7566|± |0.0118| |hellaswag | 1|none | 0|acc |↑ | 0.5764|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7592|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.6387|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5824|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.5000|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7455|± |0.0340| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8186|± |0.0270| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8354|± |0.0241| | - international_law | 1|none | 0|acc |↑ | 0.7603|± |0.0390| | - jurisprudence | 1|none | 0|acc |↑ | 0.7593|± |0.0413| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7607|± |0.0335| | - moral_disputes | 1|none | 0|acc |↑ | 0.6994|± |0.0247| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3441|± |0.0159| | - philosophy | 1|none | 0|acc |↑ | 0.7170|± |0.0256| | - prehistory | 1|none | 0|acc |↑ | 0.7253|± |0.0248| | - professional_law | 1|none | 0|acc |↑ | 0.4896|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.7719|± |0.0322| | - other | 2|none | |acc |↑ | 0.7187|± |0.0078| | - business_ethics | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7208|± |0.0276| | - college_medicine | 1|none | 0|acc |↑ | 0.6243|± |0.0369| | - global_facts | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - human_aging | 1|none | 0|acc |↑ | 0.6726|± |0.0315| | - management | 1|none | 0|acc |↑ | 0.8350|± |0.0368| | - marketing | 1|none | 0|acc |↑ | 0.8932|± |0.0202| | - medical_genetics | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - miscellaneous | 1|none | 0|acc |↑ | 0.8059|± |0.0141| | - nutrition | 1|none | 0|acc |↑ | 0.7320|± |0.0254| | - professional_accounting | 1|none | 0|acc |↑ | 0.5532|± |0.0297| | - professional_medicine | 1|none | 0|acc |↑ | 0.7500|± |0.0263| | - virology | 1|none | 0|acc |↑ | 0.5120|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7413|± |0.0078| | - econometrics | 1|none | 0|acc |↑ | 0.5526|± |0.0468| | - high_school_geography | 1|none | 0|acc |↑ | 0.7778|± |0.0296| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8756|± |0.0238| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6513|± |0.0242| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7143|± |0.0293| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8239|± |0.0163| | - human_sexuality | 1|none | 0|acc |↑ | 0.7557|± |0.0377| | - professional_psychology | 1|none | 0|acc |↑ | 0.6699|± |0.0190| | - public_relations | 1|none | 0|acc |↑ | 0.6909|± |0.0443| | - security_studies | 1|none | 0|acc |↑ | 0.7388|± |0.0281| | - sociology | 1|none | 0|acc |↑ | 0.8557|± |0.0248| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8400|± |0.0368| | - stem | 2|none | |acc |↑ | 0.5439|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - anatomy | 1|none | 0|acc |↑ | 0.6370|± |0.0415| | - astronomy | 1|none | 0|acc |↑ | 0.6908|± |0.0376| | - college_biology | 1|none | 0|acc |↑ | 0.7431|± |0.0365| | - college_chemistry | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_computer_science | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_physics | 1|none | 0|acc |↑ | 0.4902|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7700|± |0.0423| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5489|± |0.0325| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6345|± |0.0401| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4577|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7677|± |0.0240| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4828|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6800|± |0.0469| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3778|± |0.0296| | - high_school_physics | 1|none | 0|acc |↑ | 0.4437|± |0.0406| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5185|± |0.0341| | - machine_learning | 1|none | 0|acc |↑ | 0.4911|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1590|± |0.0061| |openbookqa | 1|none | 0|acc |↑ | 0.3400|± |0.0212| | | |none | 0|acc_norm |↑ | 0.4300|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7824|± |0.0096| | | |none | 0|acc_norm |↑ | 0.7873|± |0.0095| |qnli | 1|none | 0|acc |↑ | 0.5464|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9630|± |0.0060| | | |none | 0|acc_norm |↑ | 0.9320|± |0.0080| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5112|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4761|± |0.0175| | | |none | 0|bleu_diff |↑ |-0.1939|± |0.6341| | | |none | 0|bleu_max |↑ |20.2147|± |0.7257| | | |none | 0|rouge1_acc |↑ | 0.4957|± |0.0175| | | |none | 0|rouge1_diff|↑ |-0.2355|± |0.8648| | | |none | 0|rouge1_max |↑ |43.2820|± |0.8713| | | |none | 0|rouge2_acc |↑ | 0.3684|± |0.0169| | | |none | 0|rouge2_diff|↑ |-1.5024|± |0.9176| | | |none | 0|rouge2_max |↑ |27.2640|± |0.9552| | | |none | 0|rougeL_acc |↑ | 0.4798|± |0.0175| | | |none | 0|rougeL_diff|↑ |-0.6690|± |0.8701| | | |none | 0|rougeL_max |↑ |40.4168|± |0.8713| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3635|± |0.0168| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5171|± |0.0152| |winogrande | 1|none | 0|acc |↑ | 0.7167|± |0.0127| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6790|± |0.0053| |mmlu | 2|none | |acc |↑ |0.6387|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5824|± |0.0068| | - other | 2|none | |acc |↑ |0.7187|± |0.0078| | - social sciences| 2|none | |acc |↑ |0.7413|± |0.0078| | - stem | 2|none | |acc |↑ |0.5439|± |0.0086| meta-llama_Meta-Llama-3-8B-Instruct: 6h 30m 49s ✅ Benchmark completed for meta-llama_Meta-Llama-3-8B-Instruct 🔥 Starting benchmark for mistralai_Mistral-7B-Instruct-v0.3 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/mistralai_Mistral-7B-Instruct-v0.3,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4760|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4430|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.4483|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5742|± |0.0144| | | |none | 0|acc_norm |↑ | 0.5896|± |0.0144| |bbh | 3|get-answer | |exact_match|↑ | 0.5626|± |0.0056| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8520|± |0.0225| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5508|± |0.0365| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6400|± |0.0304| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5960|± |0.0311| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0720|± |0.0164| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.4680|± |0.0316| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7240|± |0.0283| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6680|± |0.0298| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5959|± |0.0408| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6880|± |0.0294| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4400|± |0.0315| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4280|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7191|± |0.0338| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9200|± |0.0172| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4240|± |0.0313| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.3000|± |0.0290| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2200|± |0.0263| |boolq | 2|none | 0|acc |↑ | 0.8584|± |0.0061| |drop | 3|none | 0|em |↑ | 0.0094|± |0.0010| | | |none | 0|f1 |↑ | 0.0900|± |0.0018| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1162|± |0.0228| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1616|± |0.0262| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2172|± |0.0294| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1777|± |0.0164| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1447|± |0.0151| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2582|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2894|± |0.0194| | | |none | 0|acc_norm |↑ | 0.2894|± |0.0194| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3480|± |0.0204| | | |none | 0|acc_norm |↑ | 0.3480|± |0.0204| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1585|± |0.0173| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1518|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2790|± |0.0212| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3371|± |0.0224| | | |none | 0|acc_norm |↑ | 0.3371|± |0.0224| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2835|± |0.0213| | | |none | 0|acc_norm |↑ | 0.2835|± |0.0213| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.4913|± |0.0138| | | |strict-match | 5|exact_match|↑ | 0.4898|± |0.0138| |hellaswag | 1|none | 0|acc |↑ | 0.6484|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8289|± |0.0038| |mmlu | 2|none | |acc |↑ | 0.5971|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5420|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.4365|± |0.0444| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7394|± |0.0343| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8039|± |0.0279| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7722|± |0.0273| | - international_law | 1|none | 0|acc |↑ | 0.7686|± |0.0385| | - jurisprudence | 1|none | 0|acc |↑ | 0.7593|± |0.0413| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7607|± |0.0335| | - moral_disputes | 1|none | 0|acc |↑ | 0.6763|± |0.0252| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2670|± |0.0148| | - philosophy | 1|none | 0|acc |↑ | 0.6624|± |0.0269| | - prehistory | 1|none | 0|acc |↑ | 0.6883|± |0.0258| | - professional_law | 1|none | 0|acc |↑ | 0.4492|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.7953|± |0.0309| | - other | 2|none | |acc |↑ | 0.6720|± |0.0081| | - business_ethics | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6906|± |0.0285| | - college_medicine | 1|none | 0|acc |↑ | 0.5607|± |0.0378| | - global_facts | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - human_aging | 1|none | 0|acc |↑ | 0.6368|± |0.0323| | - management | 1|none | 0|acc |↑ | 0.7961|± |0.0399| | - marketing | 1|none | 0|acc |↑ | 0.8718|± |0.0219| | - medical_genetics | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - miscellaneous | 1|none | 0|acc |↑ | 0.7816|± |0.0148| | - nutrition | 1|none | 0|acc |↑ | 0.6634|± |0.0271| | - professional_accounting | 1|none | 0|acc |↑ | 0.4645|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6654|± |0.0287| | - virology | 1|none | 0|acc |↑ | 0.5060|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7000|± |0.0080| | - econometrics | 1|none | 0|acc |↑ | 0.4737|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.7525|± |0.0307| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8705|± |0.0242| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5795|± |0.0250| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6050|± |0.0318| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8092|± |0.0168| | - human_sexuality | 1|none | 0|acc |↑ | 0.6947|± |0.0404| | - professional_psychology | 1|none | 0|acc |↑ | 0.6242|± |0.0196| | - public_relations | 1|none | 0|acc |↑ | 0.6455|± |0.0458| | - security_studies | 1|none | 0|acc |↑ | 0.7061|± |0.0292| | - sociology | 1|none | 0|acc |↑ | 0.8458|± |0.0255| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8500|± |0.0359| | - stem | 2|none | |acc |↑ | 0.5052|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2700|± |0.0446| | - anatomy | 1|none | 0|acc |↑ | 0.5926|± |0.0424| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.7292|± |0.0372| | - college_chemistry | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - college_mathematics | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - college_physics | 1|none | 0|acc |↑ | 0.4608|± |0.0496| | - computer_security | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5234|± |0.0327| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5655|± |0.0413| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3704|± |0.0249| | - high_school_biology | 1|none | 0|acc |↑ | 0.7323|± |0.0252| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5123|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6400|± |0.0482| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3370|± |0.0288| | - high_school_physics | 1|none | 0|acc |↑ | 0.2980|± |0.0373| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4676|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.5446|± |0.0473| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1537|± |0.0060| |openbookqa | 1|none | 0|acc |↑ | 0.3540|± |0.0214| | | |none | 0|acc_norm |↑ | 0.4700|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.8156|± |0.0090| | | |none | 0|acc_norm |↑ | 0.8270|± |0.0088| |qnli | 1|none | 0|acc |↑ | 0.5146|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9600|± |0.0062| | | |none | 0|acc_norm |↑ | 0.9430|± |0.0073| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5683|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5643|± |0.0174| | | |none | 0|bleu_diff |↑ | 8.1688|± |0.8460| | | |none | 0|bleu_max |↑ |27.6629|± |0.8109| | | |none | 0|rouge1_acc |↑ | 0.5716|± |0.0173| | | |none | 0|rouge1_diff|↑ |12.0899|± |1.1860| | | |none | 0|rouge1_max |↑ |54.8010|± |0.8641| | | |none | 0|rouge2_acc |↑ | 0.5202|± |0.0175| | | |none | 0|rouge2_diff|↑ |12.2282|± |1.2604| | | |none | 0|rouge2_max |↑ |41.2658|± |1.0220| | | |none | 0|rougeL_acc |↑ | 0.5692|± |0.0173| | | |none | 0|rougeL_diff|↑ |11.8949|± |1.1929| | | |none | 0|rougeL_max |↑ |51.6072|± |0.8958| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4211|± |0.0173| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5968|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.7403|± |0.0123| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5626|± |0.0056| |mmlu | 2|none | |acc |↑ |0.5971|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5420|± |0.0067| | - other | 2|none | |acc |↑ |0.6720|± |0.0081| | - social sciences| 2|none | |acc |↑ |0.7000|± |0.0080| | - stem | 2|none | |acc |↑ |0.5052|± |0.0086| mistralai_Mistral-7B-Instruct-v0.3: 8h 38m 15s ✅ Benchmark completed for mistralai_Mistral-7B-Instruct-v0.3 🔥 Starting benchmark for openchat_openchat-3.6-8b-20240522 🔥 Starting benchmark for internlm_internlm2_5-7b-chat 🔥 Starting benchmark for THUDM_chatglm3-6b 🔥 Starting benchmark for NousResearch_Hermes-2-Pro-Mistral-7B 🔥 Starting benchmark for Qwen_Qwen-1_8B-Chat Qwen_Qwen-1_8B-Chat: 0h 18m 23s ✅ Benchmark completed for Qwen_Qwen-1_8B-Chat 🔥 Starting benchmark for Qwen_Qwen-1_8B Qwen_Qwen-1_8B: 0h 18m 19s ✅ Benchmark completed for Qwen_Qwen-1_8B 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Qwen-7B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_DeepSeek-R1-Distill-Qwen-7B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4450|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4180|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.4100|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.4215|± |0.0144| | | |none | 0|acc_norm |↑ | 0.4377|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.5569|± |0.0050| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9600|± |0.0124| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5187|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6760|± |0.0297| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5320|± |0.0316| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.2720|± |0.0282| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3480|± |0.0302| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.8400|± |0.0232| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0200|± |0.0089| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8040|± |0.0252| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5000|± |0.0317| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9520|± |0.0135| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8320|± |0.0237| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7603|± |0.0355| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6960|± |0.0292| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3760|± |0.0307| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5506|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7480|± |0.0275| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.3760|± |0.0307| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4360|± |0.0314| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.0040|± |0.0040| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2400|± |0.0271| |boolq | 2|none | 0|acc |↑ | 0.7783|± |0.0073| |drop | 3|none | 0|em |↑ | 0.0023|± |0.0005| | | |none | 0|f1 |↑ | 0.0412|± |0.0011| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0758|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0404|± |0.0140| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2778|± |0.0319| | | |none | 0|acc_norm |↑ | 0.2778|± |0.0319| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1007|± |0.0129| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0934|± |0.0125| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1758|± |0.0163| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2839|± |0.0193| | | |none | 0|acc_norm |↑ | 0.2839|± |0.0193| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3480|± |0.0204| | | |none | 0|acc_norm |↑ | 0.3480|± |0.0204| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0848|± |0.0132| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0759|± |0.0125| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1473|± |0.0168| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2634|± |0.0208| | | |none | 0|acc_norm |↑ | 0.2634|± |0.0208| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3348|± |0.0223| | | |none | 0|acc_norm |↑ | 0.3348|± |0.0223| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7998|± |0.0110| | | |strict-match | 5|exact_match|↑ | 0.7862|± |0.0113| |hellaswag | 1|none | 0|acc |↑ | 0.4627|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6026|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.5263|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4406|± |0.0070| | - formal_logic | 1|none | 0|acc |↑ | 0.4921|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6121|± |0.0380| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6176|± |0.0341| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6203|± |0.0316| | - international_law | 1|none | 0|acc |↑ | 0.6116|± |0.0445| | - jurisprudence | 1|none | 0|acc |↑ | 0.6481|± |0.0462| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6319|± |0.0379| | - moral_disputes | 1|none | 0|acc |↑ | 0.5578|± |0.0267| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2581|± |0.0146| | - philosophy | 1|none | 0|acc |↑ | 0.5498|± |0.0283| | - prehistory | 1|none | 0|acc |↑ | 0.4599|± |0.0277| | - professional_law | 1|none | 0|acc |↑ | 0.3585|± |0.0122| | - world_religions | 1|none | 0|acc |↑ | 0.5614|± |0.0381| | - other | 2|none | |acc |↑ | 0.5391|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5358|± |0.0307| | - college_medicine | 1|none | 0|acc |↑ | 0.4971|± |0.0381| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.5426|± |0.0334| | - management | 1|none | 0|acc |↑ | 0.6796|± |0.0462| | - marketing | 1|none | 0|acc |↑ | 0.7479|± |0.0284| | - medical_genetics | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - miscellaneous | 1|none | 0|acc |↑ | 0.6309|± |0.0173| | - nutrition | 1|none | 0|acc |↑ | 0.5359|± |0.0286| | - professional_accounting | 1|none | 0|acc |↑ | 0.3936|± |0.0291| | - professional_medicine | 1|none | 0|acc |↑ | 0.3713|± |0.0293| | - virology | 1|none | 0|acc |↑ | 0.4036|± |0.0382| | - social sciences | 2|none | |acc |↑ | 0.6123|± |0.0087| | - econometrics | 1|none | 0|acc |↑ | 0.5702|± |0.0466| | - high_school_geography | 1|none | 0|acc |↑ | 0.6212|± |0.0346| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6839|± |0.0336| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5974|± |0.0249| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7185|± |0.0292| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6550|± |0.0204| | - human_sexuality | 1|none | 0|acc |↑ | 0.5420|± |0.0437| | - professional_psychology | 1|none | 0|acc |↑ | 0.5033|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5818|± |0.0472| | - security_studies | 1|none | 0|acc |↑ | 0.5918|± |0.0315| | - sociology | 1|none | 0|acc |↑ | 0.7214|± |0.0317| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - stem | 2|none | |acc |↑ | 0.5579|± |0.0087| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - anatomy | 1|none | 0|acc |↑ | 0.4444|± |0.0429| | - astronomy | 1|none | 0|acc |↑ | 0.5855|± |0.0401| | - college_biology | 1|none | 0|acc |↑ | 0.5208|± |0.0418| | - college_chemistry | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - college_mathematics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_physics | 1|none | 0|acc |↑ | 0.4020|± |0.0488| | - computer_security | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6936|± |0.0301| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5931|± |0.0409| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6058|± |0.0252| | - high_school_biology | 1|none | 0|acc |↑ | 0.6323|± |0.0274| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4975|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7300|± |0.0446| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4556|± |0.0304| | - high_school_physics | 1|none | 0|acc |↑ | 0.4901|± |0.0408| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6389|± |0.0328| | - machine_learning | 1|none | 0|acc |↑ | 0.4286|± |0.0470| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0321|± |0.0029| |openbookqa | 1|none | 0|acc |↑ | 0.2620|± |0.0197| | | |none | 0|acc_norm |↑ | 0.3600|± |0.0215| |piqa | 1|none | 0|acc |↑ | 0.7067|± |0.0106| | | |none | 0|acc_norm |↑ | 0.7165|± |0.0105| |qnli | 1|none | 0|acc |↑ | 0.5210|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9360|± |0.0077| | | |none | 0|acc_norm |↑ | 0.9180|± |0.0087| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0592|± |0.0018| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3843|± |0.0170| | | |none | 0|bleu_diff |↑ |-0.3986|± |0.4145| | | |none | 0|bleu_max |↑ |12.1556|± |0.4381| | | |none | 0|rouge1_acc |↑ | 0.4076|± |0.0172| | | |none | 0|rouge1_diff|↑ |-0.5348|± |0.7118| | | |none | 0|rouge1_max |↑ |34.2455|± |0.7048| | | |none | 0|rouge2_acc |↑ | 0.2925|± |0.0159| | | |none | 0|rouge2_diff|↑ |-1.4234|± |0.7703| | | |none | 0|rouge2_max |↑ |20.4502|± |0.7267| | | |none | 0|rougeL_acc |↑ | 0.3978|± |0.0171| | | |none | 0|rougeL_diff|↑ |-0.5833|± |0.7113| | | |none | 0|rougeL_max |↑ |32.3124|± |0.7036| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2889|± |0.0159| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4563|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.5991|± |0.0138| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5569|± |0.0050| |mmlu | 2|none | |acc |↑ |0.5263|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4406|± |0.0070| | - other | 2|none | |acc |↑ |0.5391|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.6123|± |0.0087| | - stem | 2|none | |acc |↑ |0.5579|± |0.0087| deepseek-ai_DeepSeek-R1-Distill-Qwen-7B: 6h 28m 41s ✅ Benchmark completed for deepseek-ai_DeepSeek-R1-Distill-Qwen-7B 🔥 Starting benchmark for deepseek-ai_deepseek-math-7b-rl hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_deepseek-math-7b-rl,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3680|± |0.0153| |anli_r2 | 1|none | 0|acc |↑ | 0.3890|± |0.0154| |anli_r3 | 1|none | 0|acc |↑ | 0.4050|± |0.0142| |arc_challenge | 1|none | 0|acc |↑ | 0.4795|± |0.0146| | | |none | 0|acc_norm |↑ | 0.4898|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.5247|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4439|± |0.0364| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.1840|± |0.0246| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1680|± |0.0237| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3200|± |0.0296| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7040|± |0.0289| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0600|± |0.0151| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.8040|± |0.0252| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8960|± |0.0193| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6720|± |0.0298| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.3219|± |0.0388| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5000|± |0.0317| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4888|± |0.0376| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7000|± |0.0290| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.4480|± |0.0315| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.7440|± |0.0277| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.3160|± |0.0295| |boolq | 2|none | 0|acc |↑ | 0.7560|± |0.0075| |drop | 3|none | 0|em |↑ | 0.0166|± |0.0013| | | |none | 0|f1 |↑ | 0.1190|± |0.0021| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3131|± |0.0330| | | |none | 0|acc_norm |↑ | 0.3131|± |0.0330| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2879|± |0.0323| | | |none | 0|acc_norm |↑ | 0.2879|± |0.0323| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2234|± |0.0178| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1960|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0110|± |0.0045| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1502|± |0.0153| | | |strict-match | 0|exact_match|↑ | 0.0037|± |0.0026| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2857|± |0.0194| | | |none | 0|acc_norm |↑ | 0.2857|± |0.0194| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2949|± |0.0195| | | |none | 0|acc_norm |↑ | 0.2949|± |0.0195| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1786|± |0.0181| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1964|± |0.0188| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1897|± |0.0185| | | |strict-match | 0|exact_match|↑ | 0.0089|± |0.0044| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3036|± |0.0217| | | |none | 0|acc_norm |↑ | 0.3036|± |0.0217| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2723|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2723|± |0.0211| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.1865|± |0.0107| | | |strict-match | 5|exact_match|↑ | 0.1425|± |0.0096| |hellaswag | 1|none | 0|acc |↑ | 0.5293|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6896|± |0.0046| |mmlu | 2|none | |acc |↑ | 0.5250|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4417|± |0.0070| | - formal_logic | 1|none | 0|acc |↑ | 0.5000|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6424|± |0.0374| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5294|± |0.0350| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6667|± |0.0307| | - international_law | 1|none | 0|acc |↑ | 0.6116|± |0.0445| | - jurisprudence | 1|none | 0|acc |↑ | 0.5463|± |0.0481| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6442|± |0.0376| | - moral_disputes | 1|none | 0|acc |↑ | 0.5376|± |0.0268| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2492|± |0.0145| | - philosophy | 1|none | 0|acc |↑ | 0.5466|± |0.0283| | - prehistory | 1|none | 0|acc |↑ | 0.4877|± |0.0278| | - professional_law | 1|none | 0|acc |↑ | 0.3722|± |0.0123| | - world_religions | 1|none | 0|acc |↑ | 0.5673|± |0.0380| | - other | 2|none | |acc |↑ | 0.5552|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5396|± |0.0307| | - college_medicine | 1|none | 0|acc |↑ | 0.5549|± |0.0379| | - global_facts | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - human_aging | 1|none | 0|acc |↑ | 0.5471|± |0.0334| | - management | 1|none | 0|acc |↑ | 0.6990|± |0.0454| | - marketing | 1|none | 0|acc |↑ | 0.7735|± |0.0274| | - medical_genetics | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - miscellaneous | 1|none | 0|acc |↑ | 0.6424|± |0.0171| | - nutrition | 1|none | 0|acc |↑ | 0.5490|± |0.0285| | - professional_accounting | 1|none | 0|acc |↑ | 0.3901|± |0.0291| | - professional_medicine | 1|none | 0|acc |↑ | 0.4669|± |0.0303| | - virology | 1|none | 0|acc |↑ | 0.3795|± |0.0378| | - social sciences | 2|none | |acc |↑ | 0.6107|± |0.0087| | - econometrics | 1|none | 0|acc |↑ | 0.4825|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.6616|± |0.0337| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6632|± |0.0341| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5718|± |0.0251| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6639|± |0.0307| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6899|± |0.0198| | - human_sexuality | 1|none | 0|acc |↑ | 0.5878|± |0.0432| | - professional_psychology | 1|none | 0|acc |↑ | 0.4902|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5909|± |0.0471| | - security_studies | 1|none | 0|acc |↑ | 0.5918|± |0.0315| | - sociology | 1|none | 0|acc |↑ | 0.7512|± |0.0306| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - stem | 2|none | |acc |↑ | 0.5360|± |0.0088| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - anatomy | 1|none | 0|acc |↑ | 0.4667|± |0.0431| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.6042|± |0.0409| | - college_chemistry | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - college_physics | 1|none | 0|acc |↑ | 0.4020|± |0.0488| | - computer_security | 1|none | 0|acc |↑ | 0.6500|± |0.0479| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5702|± |0.0324| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6069|± |0.0407| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.5608|± |0.0256| | - high_school_biology | 1|none | 0|acc |↑ | 0.6387|± |0.0273| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5172|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4185|± |0.0301| | - high_school_physics | 1|none | 0|acc |↑ | 0.3709|± |0.0394| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5741|± |0.0337| | - machine_learning | 1|none | 0|acc |↑ | 0.5179|± |0.0474| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0393|± |0.0032| |openbookqa | 1|none | 0|acc |↑ | 0.3280|± |0.0210| | | |none | 0|acc_norm |↑ | 0.4240|± |0.0221| |piqa | 1|none | 0|acc |↑ | 0.7410|± |0.0102| | | |none | 0|acc_norm |↑ | 0.7503|± |0.0101| |qnli | 1|none | 0|acc |↑ | 0.4990|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9540|± |0.0066| | | |none | 0|acc_norm |↑ | 0.9280|± |0.0082| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1747|± |0.0028| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3452|± |0.0166| | | |none | 0|bleu_diff |↑ |-4.1463|± |0.6373| | | |none | 0|bleu_max |↑ |21.0959|± |0.7171| | | |none | 0|rouge1_acc |↑ | 0.3256|± |0.0164| | | |none | 0|rouge1_diff|↑ |-6.5427|± |0.7175| | | |none | 0|rouge1_max |↑ |44.3011|± |0.8363| | | |none | 0|rouge2_acc |↑ | 0.2815|± |0.0157| | | |none | 0|rouge2_diff|↑ |-6.6740|± |0.8391| | | |none | 0|rouge2_max |↑ |29.9891|± |0.9108| | | |none | 0|rougeL_acc |↑ | 0.3097|± |0.0162| | | |none | 0|rougeL_diff|↑ |-6.8840|± |0.7254| | | |none | 0|rougeL_max |↑ |41.2479|± |0.8408| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2876|± |0.0158| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4029|± |0.0153| |winogrande | 1|none | 0|acc |↑ | 0.6511|± |0.0134| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5247|± |0.0054| |mmlu | 2|none | |acc |↑ |0.5250|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4417|± |0.0070| | - other | 2|none | |acc |↑ |0.5552|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.6107|± |0.0087| | - stem | 2|none | |acc |↑ |0.5360|± |0.0088| deepseek-ai_deepseek-math-7b-rl: 8h 2m 14s ✅ Benchmark completed for deepseek-ai_deepseek-math-7b-rl 🔥 Starting benchmark for deepseek-ai_deepseek-llm-7b-chat hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_deepseek-llm-7b-chat,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4230|± |0.0156| |anli_r2 | 1|none | 0|acc |↑ | 0.4190|± |0.0156| |anli_r3 | 1|none | 0|acc |↑ | 0.4208|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.4812|± |0.0146| | | |none | 0|acc_norm |↑ | 0.4966|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.4548|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5615|± |0.0364| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3520|± |0.0303| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.7080|± |0.0288| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2840|± |0.0286| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2160|± |0.0261| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.5920|± |0.0311| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7680|± |0.0268| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.1000|± |0.0190| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.5920|± |0.0311| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5960|± |0.0311| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4795|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4840|± |0.0317| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5225|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9160|± |0.0176| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1520|± |0.0228| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1360|± |0.0217| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3200|± |0.0296| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9320|± |0.0160| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0640|± |0.0155| |boolq | 2|none | 0|acc |↑ | 0.8330|± |0.0065| |drop | 3|none | 0|em |↑ | 0.0113|± |0.0011| | | |none | 0|f1 |↑ | 0.1030|± |0.0019| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0808|± |0.0194| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1515|± |0.0255| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2626|± |0.0314| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3030|± |0.0327| | | |none | 0|acc_norm |↑ | 0.3030|± |0.0327| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3485|± |0.0339| | | |none | 0|acc_norm |↑ | 0.3485|± |0.0339| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1612|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1612|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2564|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2711|± |0.0190| | | |none | 0|acc_norm |↑ | 0.2711|± |0.0190| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2766|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2766|± |0.0192| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1607|± |0.0174| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1496|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2679|± |0.0209| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2746|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2746|± |0.0211| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2924|± |0.0215| | | |none | 0|acc_norm |↑ | 0.2924|± |0.0215| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.5087|± |0.0138| | | |strict-match | 5|exact_match|↑ | 0.4640|± |0.0137| |hellaswag | 1|none | 0|acc |↑ | 0.5914|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7772|± |0.0042| |mmlu | 2|none | |acc |↑ | 0.4988|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4627|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.3254|± |0.0419| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6485|± |0.0373| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6912|± |0.0324| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7215|± |0.0292| | - international_law | 1|none | 0|acc |↑ | 0.6364|± |0.0439| | - jurisprudence | 1|none | 0|acc |↑ | 0.6019|± |0.0473| | - logical_fallacies | 1|none | 0|acc |↑ | 0.6626|± |0.0371| | - moral_disputes | 1|none | 0|acc |↑ | 0.5607|± |0.0267| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2391|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.5627|± |0.0282| | - prehistory | 1|none | 0|acc |↑ | 0.5586|± |0.0276| | - professional_law | 1|none | 0|acc |↑ | 0.3761|± |0.0124| | - world_religions | 1|none | 0|acc |↑ | 0.7368|± |0.0338| | - other | 2|none | |acc |↑ | 0.5768|± |0.0086| | - business_ethics | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5585|± |0.0306| | - college_medicine | 1|none | 0|acc |↑ | 0.4740|± |0.0381| | - global_facts | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - human_aging | 1|none | 0|acc |↑ | 0.5291|± |0.0335| | - management | 1|none | 0|acc |↑ | 0.6893|± |0.0458| | - marketing | 1|none | 0|acc |↑ | 0.7821|± |0.0270| | - medical_genetics | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - miscellaneous | 1|none | 0|acc |↑ | 0.7241|± |0.0160| | - nutrition | 1|none | 0|acc |↑ | 0.5490|± |0.0285| | - professional_accounting | 1|none | 0|acc |↑ | 0.3546|± |0.0285| | - professional_medicine | 1|none | 0|acc |↑ | 0.4522|± |0.0302| | - virology | 1|none | 0|acc |↑ | 0.4578|± |0.0388| | - social sciences | 2|none | |acc |↑ | 0.5635|± |0.0087| | - econometrics | 1|none | 0|acc |↑ | 0.3246|± |0.0440| | - high_school_geography | 1|none | 0|acc |↑ | 0.6364|± |0.0343| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6839|± |0.0336| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4333|± |0.0251| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4328|± |0.0322| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6991|± |0.0197| | - human_sexuality | 1|none | 0|acc |↑ | 0.5420|± |0.0437| | - professional_psychology | 1|none | 0|acc |↑ | 0.4886|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5727|± |0.0474| | - security_studies | 1|none | 0|acc |↑ | 0.5796|± |0.0316| | - sociology | 1|none | 0|acc |↑ | 0.7015|± |0.0324| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7000|± |0.0461| | - stem | 2|none | |acc |↑ | 0.4126|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - anatomy | 1|none | 0|acc |↑ | 0.4963|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.5329|± |0.0406| | - college_biology | 1|none | 0|acc |↑ | 0.5208|± |0.0418| | - college_chemistry | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - college_computer_science | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_mathematics | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - college_physics | 1|none | 0|acc |↑ | 0.3235|± |0.0466| | - computer_security | 1|none | 0|acc |↑ | 0.6200|± |0.0488| | - conceptual_physics | 1|none | 0|acc |↑ | 0.3915|± |0.0319| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4069|± |0.0409| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3307|± |0.0242| | - high_school_biology | 1|none | 0|acc |↑ | 0.5903|± |0.0280| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3793|± |0.0341| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2741|± |0.0272| | - high_school_physics | 1|none | 0|acc |↑ | 0.2914|± |0.0371| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3750|± |0.0330| | - machine_learning | 1|none | 0|acc |↑ | 0.5000|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0634|± |0.0041| |openbookqa | 1|none | 0|acc |↑ | 0.3500|± |0.0214| | | |none | 0|acc_norm |↑ | 0.4600|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7949|± |0.0094| | | |none | 0|acc_norm |↑ | 0.8014|± |0.0093| |qnli | 1|none | 0|acc |↑ | 0.4970|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9250|± |0.0083| | | |none | 0|acc_norm |↑ | 0.8930|± |0.0098| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3112|± |0.0035| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4321|± |0.0173| | | |none | 0|bleu_diff |↑ |-1.9106|± |0.5822| | | |none | 0|bleu_max |↑ |20.8129|± |0.7241| | | |none | 0|rouge1_acc |↑ | 0.4541|± |0.0174| | | |none | 0|rouge1_diff|↑ |-2.6308|± |0.6640| | | |none | 0|rouge1_max |↑ |45.5368|± |0.8107| | | |none | 0|rouge2_acc |↑ | 0.3415|± |0.0166| | | |none | 0|rouge2_diff|↑ |-3.7356|± |0.7920| | | |none | 0|rouge2_max |↑ |31.2711|± |0.9083| | | |none | 0|rougeL_acc |↑ | 0.4272|± |0.0173| | | |none | 0|rougeL_diff|↑ |-2.8527|± |0.6674| | | |none | 0|rougeL_max |↑ |42.3939|± |0.8172| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3488|± |0.0167| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4789|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.7017|± |0.0129| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4548|± |0.0054| |mmlu | 2|none | |acc |↑ |0.4988|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4627|± |0.0069| | - other | 2|none | |acc |↑ |0.5768|± |0.0086| | - social sciences| 2|none | |acc |↑ |0.5635|± |0.0087| | - stem | 2|none | |acc |↑ |0.4126|± |0.0086| deepseek-ai_deepseek-llm-7b-chat: 10h 7m 2s ✅ Benchmark completed for deepseek-ai_deepseek-llm-7b-chat 🔥 Starting benchmark for deepseek-ai_deepseek-llm-7b-base hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_deepseek-llm-7b-base,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|-------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3400|± |0.0150| |anli_r2 | 1|none | 0|acc |↑ | 0.3630|± |0.0152| |anli_r3 | 1|none | 0|acc |↑ | 0.3775|± |0.0140| |arc_challenge | 1|none | 0|acc |↑ | 0.4352|± |0.0145| | | |none | 0|acc_norm |↑ | 0.4454|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4237|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4759|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.4120|± |0.0312| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5160|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6720|± |0.0298| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1920|± |0.0250| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7920|± |0.0257| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.5600|± |0.0315| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.3973|± |0.0406| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4680|± |0.0316| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.4560|± |0.0316| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3160|± |0.0295| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.5056|± |0.0376| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9040|± |0.0187| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2800|± |0.0285| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1920|± |0.0250| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1040|± |0.0193| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3240|± |0.0297| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.7640|± |0.0269| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0640|± |0.0155| |boolq | 2|none | 0|acc |↑ | 0.7235|± |0.0078| |drop | 3|none | 0|em |↑ | 0.0168|± |0.0013| | | |none | 0|f1 |↑ | 0.0422|± |0.0016| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1061|± |0.0219| | | |strict-match | 0|exact_match|↑ | 0.0303|± |0.0122| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0960|± |0.0210| | | |strict-match | 0|exact_match|↑ | 0.0152|± |0.0087| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2020|± |0.0286| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1026|± |0.0130| | | |strict-match | 0|exact_match|↑ | 0.0165|± |0.0055| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1062|± |0.0132| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1960|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2711|± |0.0190| | | |none | 0|acc_norm |↑ | 0.2711|± |0.0190| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2747|± |0.0191| | | |none | 0|acc_norm |↑ | 0.2747|± |0.0191| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1250|± |0.0156| | | |strict-match | 0|exact_match|↑ | 0.0223|± |0.0070| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0871|± |0.0133| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2344|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2567|± |0.0207| | | |none | 0|acc_norm |↑ | 0.2567|± |0.0207| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2522|± |0.0205| | | |none | 0|acc_norm |↑ | 0.2522|± |0.0205| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.1638|± |0.0102| | | |strict-match | 5|exact_match|↑ | 0.1622|± |0.0102| |hellaswag | 1|none | 0|acc |↑ | 0.5706|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7606|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.4428|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4106|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.2540|± |0.0389| | - high_school_european_history | 1|none | 0|acc |↑ | 0.5576|± |0.0388| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5490|± |0.0349| | - high_school_world_history | 1|none | 0|acc |↑ | 0.5992|± |0.0319| | - international_law | 1|none | 0|acc |↑ | 0.5868|± |0.0450| | - jurisprudence | 1|none | 0|acc |↑ | 0.5926|± |0.0475| | - logical_fallacies | 1|none | 0|acc |↑ | 0.5951|± |0.0386| | - moral_disputes | 1|none | 0|acc |↑ | 0.4827|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.5498|± |0.0283| | - prehistory | 1|none | 0|acc |↑ | 0.5062|± |0.0278| | - professional_law | 1|none | 0|acc |↑ | 0.3214|± |0.0119| | - world_religions | 1|none | 0|acc |↑ | 0.6433|± |0.0367| | - other | 2|none | |acc |↑ | 0.4982|± |0.0088| | - business_ethics | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.4792|± |0.0307| | - college_medicine | 1|none | 0|acc |↑ | 0.4104|± |0.0375| | - global_facts | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - human_aging | 1|none | 0|acc |↑ | 0.4798|± |0.0335| | - management | 1|none | 0|acc |↑ | 0.5437|± |0.0493| | - marketing | 1|none | 0|acc |↑ | 0.6453|± |0.0313| | - medical_genetics | 1|none | 0|acc |↑ | 0.4600|± |0.0501| | - miscellaneous | 1|none | 0|acc |↑ | 0.6245|± |0.0173| | - nutrition | 1|none | 0|acc |↑ | 0.4771|± |0.0286| | - professional_accounting | 1|none | 0|acc |↑ | 0.3794|± |0.0289| | - professional_medicine | 1|none | 0|acc |↑ | 0.3897|± |0.0296| | - virology | 1|none | 0|acc |↑ | 0.4036|± |0.0382| | - social sciences | 2|none | |acc |↑ | 0.5005|± |0.0089| | - econometrics | 1|none | 0|acc |↑ | 0.2719|± |0.0419| | - high_school_geography | 1|none | 0|acc |↑ | 0.4697|± |0.0356| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.5492|± |0.0359| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4179|± |0.0250| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.3992|± |0.0318| | - high_school_psychology | 1|none | 0|acc |↑ | 0.5817|± |0.0211| | - human_sexuality | 1|none | 0|acc |↑ | 0.5725|± |0.0434| | - professional_psychology | 1|none | 0|acc |↑ | 0.4641|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5091|± |0.0479| | - security_studies | 1|none | 0|acc |↑ | 0.5020|± |0.0320| | - sociology | 1|none | 0|acc |↑ | 0.6368|± |0.0340| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - stem | 2|none | |acc |↑ | 0.3800|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3000|± |0.0461| | - anatomy | 1|none | 0|acc |↑ | 0.4815|± |0.0432| | - astronomy | 1|none | 0|acc |↑ | 0.4934|± |0.0407| | - college_biology | 1|none | 0|acc |↑ | 0.4653|± |0.0417| | - college_chemistry | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_computer_science | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - college_mathematics | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - college_physics | 1|none | 0|acc |↑ | 0.2745|± |0.0444| | - computer_security | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - conceptual_physics | 1|none | 0|acc |↑ | 0.3957|± |0.0320| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4621|± |0.0415| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.2857|± |0.0233| | - high_school_biology | 1|none | 0|acc |↑ | 0.5032|± |0.0284| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3744|± |0.0341| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2815|± |0.0274| | - high_school_physics | 1|none | 0|acc |↑ | 0.3245|± |0.0382| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3796|± |0.0331| | - machine_learning | 1|none | 0|acc |↑ | 0.2857|± |0.0429| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1510|± |0.0060| |openbookqa | 1|none | 0|acc |↑ | 0.3260|± |0.0210| | | |none | 0|acc_norm |↑ | 0.4340|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7894|± |0.0095| | | |none | 0|acc_norm |↑ | 0.7976|± |0.0094| |qnli | 1|none | 0|acc |↑ | 0.4959|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9400|± |0.0075| | | |none | 0|acc_norm |↑ | 0.9150|± |0.0088| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5004|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3097|± |0.0162| | | |none | 0|bleu_diff |↑ | -8.7271|± |0.7712| | | |none | 0|bleu_max |↑ | 24.9259|± |0.7566| | | |none | 0|rouge1_acc |↑ | 0.2974|± |0.0160| | | |none | 0|rouge1_diff|↑ |-11.0783|± |0.8128| | | |none | 0|rouge1_max |↑ | 50.8642|± |0.8265| | | |none | 0|rouge2_acc |↑ | 0.2436|± |0.0150| | | |none | 0|rouge2_diff|↑ |-13.5478|± |0.9872| | | |none | 0|rouge2_max |↑ | 34.4263|± |0.9544| | | |none | 0|rougeL_acc |↑ | 0.2876|± |0.0158| | | |none | 0|rougeL_diff|↑ |-11.6501|± |0.8204| | | |none | 0|rougeL_max |↑ | 47.8267|± |0.8414| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2326|± |0.0148| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.3492|± |0.0137| |winogrande | 1|none | 0|acc |↑ | 0.6938|± |0.0130| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4237|± |0.0054| |mmlu | 2|none | |acc |↑ |0.4428|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4106|± |0.0069| | - other | 2|none | |acc |↑ |0.4982|± |0.0088| | - social sciences| 2|none | |acc |↑ |0.5005|± |0.0089| | - stem | 2|none | |acc |↑ |0.3800|± |0.0085| deepseek-ai_deepseek-llm-7b-base: 7h 11m 27s ✅ Benchmark completed for deepseek-ai_deepseek-llm-7b-base 🔥 Starting benchmark for openchat_openchat-3.6-8b-20240522 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/openchat_openchat-3.6-8b-20240522,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5560|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.5130|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4800|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5640|± |0.0145| | | |none | 0|acc_norm |↑ | 0.6032|± |0.0143| |bbh | 3|get-answer | |exact_match|↑ | 0.6179|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5294|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5840|± |0.0312| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.7800|± |0.0263| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0320|± |0.0112| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5480|± |0.0315| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.4600|± |0.0316| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9440|± |0.0146| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4200|± |0.0313| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.3600|± |0.0304| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7480|± |0.0275| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.8720|± |0.0212| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.7320|± |0.0281| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.6233|± |0.0402| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6200|± |0.0308| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6200|± |0.0308| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4551|± |0.0374| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8760|± |0.0209| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4720|± |0.0316| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.4360|± |0.0314| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.8920|± |0.0197| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| |boolq | 2|none | 0|acc |↑ | 0.8728|± |0.0058| |drop | 3|none | 0|em |↑ | 0.0547|± |0.0023| | | |none | 0|f1 |↑ | 0.2516|± |0.0032| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2172|± |0.0294| | | |strict-match | 0|exact_match|↑ | 0.0202|± |0.0100| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2020|± |0.0286| | | |strict-match | 0|exact_match|↑ | 0.0556|± |0.0163| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2980|± |0.0326| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3283|± |0.0335| | | |none | 0|acc_norm |↑ | 0.3283|± |0.0335| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3333|± |0.0336| | | |none | 0|acc_norm |↑ | 0.3333|± |0.0336| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2216|± |0.0178| | | |strict-match | 0|exact_match|↑ | 0.0147|± |0.0051| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2546|± |0.0187| | | |strict-match | 0|exact_match|↑ | 0.0403|± |0.0084| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3095|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3242|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3242|± |0.0200| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3425|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3425|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2098|± |0.0193| | | |strict-match | 0|exact_match|↑ | 0.0134|± |0.0054| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.2567|± |0.0207| | | |strict-match | 0|exact_match|↑ | 0.0268|± |0.0076| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3237|± |0.0221| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3438|± |0.0225| | | |none | 0|acc_norm |↑ | 0.3438|± |0.0225| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3326|± |0.0223| | | |none | 0|acc_norm |↑ | 0.3326|± |0.0223| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7521|± |0.0119| | | |strict-match | 5|exact_match|↑ | 0.7506|± |0.0119| |hellaswag | 1|none | 0|acc |↑ | 0.6116|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7978|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.6431|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5966|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.5000|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7515|± |0.0337| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8333|± |0.0262| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8439|± |0.0236| | - international_law | 1|none | 0|acc |↑ | 0.7438|± |0.0398| | - jurisprudence | 1|none | 0|acc |↑ | 0.7870|± |0.0396| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7423|± |0.0344| | - moral_disputes | 1|none | 0|acc |↑ | 0.7110|± |0.0244| | - moral_scenarios | 1|none | 0|acc |↑ | 0.4313|± |0.0166| | - philosophy | 1|none | 0|acc |↑ | 0.6849|± |0.0264| | - prehistory | 1|none | 0|acc |↑ | 0.7191|± |0.0250| | - professional_law | 1|none | 0|acc |↑ | 0.4831|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.7895|± |0.0313| | - other | 2|none | |acc |↑ | 0.7071|± |0.0079| | - business_ethics | 1|none | 0|acc |↑ | 0.6500|± |0.0479| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7245|± |0.0275| | - college_medicine | 1|none | 0|acc |↑ | 0.6358|± |0.0367| | - global_facts | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - human_aging | 1|none | 0|acc |↑ | 0.7175|± |0.0302| | - management | 1|none | 0|acc |↑ | 0.7961|± |0.0399| | - marketing | 1|none | 0|acc |↑ | 0.8803|± |0.0213| | - medical_genetics | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - miscellaneous | 1|none | 0|acc |↑ | 0.8238|± |0.0136| | - nutrition | 1|none | 0|acc |↑ | 0.7124|± |0.0259| | - professional_accounting | 1|none | 0|acc |↑ | 0.5142|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6434|± |0.0291| | - virology | 1|none | 0|acc |↑ | 0.5120|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7452|± |0.0077| | - econometrics | 1|none | 0|acc |↑ | 0.4561|± |0.0469| | - high_school_geography | 1|none | 0|acc |↑ | 0.7828|± |0.0294| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9067|± |0.0210| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6590|± |0.0240| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7353|± |0.0287| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8349|± |0.0159| | - human_sexuality | 1|none | 0|acc |↑ | 0.7710|± |0.0369| | - professional_psychology | 1|none | 0|acc |↑ | 0.6765|± |0.0189| | - public_relations | 1|none | 0|acc |↑ | 0.6364|± |0.0461| | - security_studies | 1|none | 0|acc |↑ | 0.7347|± |0.0283| | - sociology | 1|none | 0|acc |↑ | 0.8458|± |0.0255| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8900|± |0.0314| | - stem | 2|none | |acc |↑ | 0.5496|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - anatomy | 1|none | 0|acc |↑ | 0.6815|± |0.0402| | - astronomy | 1|none | 0|acc |↑ | 0.7368|± |0.0358| | - college_biology | 1|none | 0|acc |↑ | 0.7153|± |0.0377| | - college_chemistry | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_computer_science | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_physics | 1|none | 0|acc |↑ | 0.4804|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5787|± |0.0323| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5655|± |0.0413| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4656|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7484|± |0.0247| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5320|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4000|± |0.0299| | - high_school_physics | 1|none | 0|acc |↑ | 0.3907|± |0.0398| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5093|± |0.0341| | - machine_learning | 1|none | 0|acc |↑ | 0.5625|± |0.0471| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.1706|± |0.0063| |openbookqa | 1|none | 0|acc |↑ | 0.3700|± |0.0216| | | |none | 0|acc_norm |↑ | 0.4620|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.8041|± |0.0093| | | |none | 0|acc_norm |↑ | 0.8183|± |0.0090| |qnli | 1|none | 0|acc |↑ | 0.7300|± |0.0060| |sciq | 1|none | 0|acc |↑ | 0.9730|± |0.0051| | | |none | 0|acc_norm |↑ | 0.9640|± |0.0059| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.5659|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4162|± |0.0173| | | |none | 0|bleu_diff |↑ |-2.4558|± |0.6495| | | |none | 0|bleu_max |↑ |22.9231|± |0.7496| | | |none | 0|rouge1_acc |↑ | 0.4088|± |0.0172| | | |none | 0|rouge1_diff|↑ |-3.9076|± |0.7660| | | |none | 0|rouge1_max |↑ |47.4547|± |0.8751| | | |none | 0|rouge2_acc |↑ | 0.3550|± |0.0168| | | |none | 0|rouge2_diff|↑ |-4.4347|± |0.8978| | | |none | 0|rouge2_max |↑ |33.1938|± |0.9499| | | |none | 0|rougeL_acc |↑ | 0.4051|± |0.0172| | | |none | 0|rougeL_diff|↑ |-3.9650|± |0.7656| | | |none | 0|rougeL_max |↑ |44.6201|± |0.8785| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3525|± |0.0167| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4976|± |0.0152| |winogrande | 1|none | 0|acc |↑ | 0.7632|± |0.0119| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.6179|± |0.0054| |mmlu | 2|none | |acc |↑ |0.6431|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5966|± |0.0068| | - other | 2|none | |acc |↑ |0.7071|± |0.0079| | - social sciences| 2|none | |acc |↑ |0.7452|± |0.0077| | - stem | 2|none | |acc |↑ |0.5496|± |0.0086| openchat_openchat-3.6-8b-20240522: 7h 51m 28s ✅ Benchmark completed for openchat_openchat-3.6-8b-20240522 🔥 Starting benchmark for internlm_internlm2_5-7b-chat internlm_internlm2_5-7b-chat: 0h 4m 58s ✅ Benchmark completed for internlm_internlm2_5-7b-chat 🔥 Starting benchmark for THUDM_chatglm3-6b THUDM_chatglm3-6b: 0h 32m 25s ✅ Benchmark completed for THUDM_chatglm3-6b 🔥 Starting benchmark for NousResearch_Hermes-2-Pro-Mistral-7B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/NousResearch_Hermes-2-Pro-Mistral-7B,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 3 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5310|± |0.0158| |anli_r2 | 1|none | 0|acc |↑ | 0.4960|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5000|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5444|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5657|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.5738|± |0.0055| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5615|± |0.0364| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6320|± |0.0306| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6200|± |0.0308| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0720|± |0.0164| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.8000|± |0.0253| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4640|± |0.0316| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2240|± |0.0264| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.8680|± |0.0215| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6440|± |0.0303| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6840|± |0.0295| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5959|± |0.0408| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.6480|± |0.0303| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5600|± |0.0315| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.7022|± |0.0344| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.9400|± |0.0151| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.6480|± |0.0303| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.2720|± |0.0282| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.4840|± |0.0317| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9640|± |0.0118| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2320|± |0.0268| |boolq | 2|none | 0|acc |↑ | 0.8682|± |0.0059| |drop | 3|none | 0|em |↑ | 0.0167|± |0.0013| | | |none | 0|f1 |↑ | 0.1098|± |0.0022| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1162|± |0.0228| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0960|± |0.0210| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2525|± |0.0310| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3232|± |0.0333| | | |none | 0|acc_norm |↑ | 0.3232|± |0.0333| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1685|± |0.0160| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1520|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2637|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2875|± |0.0194| | | |none | 0|acc_norm |↑ | 0.2875|± |0.0194| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2949|± |0.0195| | | |none | 0|acc_norm |↑ | 0.2949|± |0.0195| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1317|± |0.0160| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1228|± |0.0155| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2366|± |0.0201| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2991|± |0.0217| | | |none | 0|acc_norm |↑ | 0.2991|± |0.0217| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2768|± |0.0212| | | |none | 0|acc_norm |↑ | 0.2768|± |0.0212| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6861|± |0.0128| | | |strict-match | 5|exact_match|↑ | 0.6854|± |0.0128| |hellaswag | 1|none | 0|acc |↑ | 0.6270|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8049|± |0.0040| |mmlu | 2|none | |acc |↑ | 0.6051|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5484|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.4127|± |0.0440| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7697|± |0.0329| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7990|± |0.0281| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8354|± |0.0241| | - international_law | 1|none | 0|acc |↑ | 0.7769|± |0.0380| | - jurisprudence | 1|none | 0|acc |↑ | 0.7130|± |0.0437| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7178|± |0.0354| | - moral_disputes | 1|none | 0|acc |↑ | 0.6821|± |0.0251| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2492|± |0.0145| | - philosophy | 1|none | 0|acc |↑ | 0.6752|± |0.0266| | - prehistory | 1|none | 0|acc |↑ | 0.7253|± |0.0248| | - professional_law | 1|none | 0|acc |↑ | 0.4622|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.8129|± |0.0299| | - other | 2|none | |acc |↑ | 0.6807|± |0.0081| | - business_ethics | 1|none | 0|acc |↑ | 0.5600|± |0.0499| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6755|± |0.0288| | - college_medicine | 1|none | 0|acc |↑ | 0.6301|± |0.0368| | - global_facts | 1|none | 0|acc |↑ | 0.4000|± |0.0492| | - human_aging | 1|none | 0|acc |↑ | 0.6771|± |0.0314| | - management | 1|none | 0|acc |↑ | 0.7864|± |0.0406| | - marketing | 1|none | 0|acc |↑ | 0.8632|± |0.0225| | - medical_genetics | 1|none | 0|acc |↑ | 0.6500|± |0.0479| | - miscellaneous | 1|none | 0|acc |↑ | 0.8020|± |0.0142| | - nutrition | 1|none | 0|acc |↑ | 0.6928|± |0.0264| | - professional_accounting | 1|none | 0|acc |↑ | 0.4362|± |0.0296| | - professional_medicine | 1|none | 0|acc |↑ | 0.6765|± |0.0284| | - virology | 1|none | 0|acc |↑ | 0.5120|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7082|± |0.0080| | - econometrics | 1|none | 0|acc |↑ | 0.3860|± |0.0458| | - high_school_geography | 1|none | 0|acc |↑ | 0.7424|± |0.0312| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8446|± |0.0261| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6256|± |0.0245| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6681|± |0.0306| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8239|± |0.0163| | - human_sexuality | 1|none | 0|acc |↑ | 0.7328|± |0.0388| | - professional_psychology | 1|none | 0|acc |↑ | 0.6373|± |0.0195| | - public_relations | 1|none | 0|acc |↑ | 0.5909|± |0.0471| | - security_studies | 1|none | 0|acc |↑ | 0.7020|± |0.0293| | - sociology | 1|none | 0|acc |↑ | 0.8159|± |0.0274| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8600|± |0.0349| | - stem | 2|none | |acc |↑ | 0.5147|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3200|± |0.0469| | - anatomy | 1|none | 0|acc |↑ | 0.5704|± |0.0428| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.7083|± |0.0380| | - college_chemistry | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_computer_science | 1|none | 0|acc |↑ | 0.5100|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.2800|± |0.0451| | - college_physics | 1|none | 0|acc |↑ | 0.4216|± |0.0491| | - computer_security | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5617|± |0.0324| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4828|± |0.0416| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4471|± |0.0256| | - high_school_biology | 1|none | 0|acc |↑ | 0.7387|± |0.0250| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4828|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6400|± |0.0482| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3704|± |0.0294| | - high_school_physics | 1|none | 0|acc |↑ | 0.3377|± |0.0386| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4954|± |0.0341| | - machine_learning | 1|none | 0|acc |↑ | 0.4911|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0404|± |0.0033| |openbookqa | 1|none | 0|acc |↑ | 0.3380|± |0.0212| | | |none | 0|acc_norm |↑ | 0.4340|± |0.0222| |piqa | 1|none | 0|acc |↑ | 0.7938|± |0.0094| | | |none | 0|acc_norm |↑ | 0.7987|± |0.0094| |qnli | 1|none | 0|acc |↑ | 0.5565|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9500|± |0.0069| | | |none | 0|acc_norm |↑ | 0.9170|± |0.0087| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.4711|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5606|± |0.0174| | | |none | 0|bleu_diff |↑ | 8.7131|± |0.8710| | | |none | 0|bleu_max |↑ |27.5391|± |0.8368| | | |none | 0|rouge1_acc |↑ | 0.5887|± |0.0172| | | |none | 0|rouge1_diff|↑ |12.4445|± |1.2143| | | |none | 0|rouge1_max |↑ |54.1696|± |0.8927| | | |none | 0|rouge2_acc |↑ | 0.4994|± |0.0175| | | |none | 0|rouge2_diff|↑ |12.4661|± |1.2976| | | |none | 0|rouge2_max |↑ |40.9144|± |1.0592| | | |none | 0|rougeL_acc |↑ | 0.5569|± |0.0174| | | |none | 0|rougeL_diff|↑ |11.9317|± |1.2333| | | |none | 0|rougeL_max |↑ |50.9560|± |0.9357| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4137|± |0.0172| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5912|± |0.0158| |winogrande | 1|none | 0|acc |↑ | 0.7198|± |0.0126| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.5738|± |0.0055| |mmlu | 2|none | |acc |↑ |0.6051|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5484|± |0.0067| | - other | 2|none | |acc |↑ |0.6807|± |0.0081| | - social sciences| 2|none | |acc |↑ |0.7082|± |0.0080| | - stem | 2|none | |acc |↑ |0.5147|± |0.0086| NousResearch_Hermes-2-Pro-Mistral-7B: 8h 27m 23s ✅ Benchmark completed for NousResearch_Hermes-2-Pro-Mistral-7B 🔥 Starting benchmark for deepseek-ai_deepseek-moe-16b-base deepseek-ai_deepseek-moe-16b-base: 0h 0m 13s ✅ Benchmark completed for deepseek-ai_deepseek-moe-16b-base 🔥 Starting benchmark for deepseek-ai_deepseek-moe-16b-chat 🔥 Starting benchmark for baichuan-inc_Baichuan-M1-14B-Instruct baichuan-inc_Baichuan-M1-14B-Instruct: 0h 0m 4s ✅ Benchmark completed for baichuan-inc_Baichuan-M1-14B-Instruct 🔥 Starting benchmark for baichuan-inc_Baichuan2-13B-Chat 🔥 Starting benchmark for baichuan-inc_Baichuan2-13B-Chat 🔥 Starting benchmark for baichuan-inc_Baichuan-M1-14B-Instruct baichuan-inc_Baichuan-M1-14B-Instruct: 0h 0m 4s ✅ Benchmark completed for baichuan-inc_Baichuan-M1-14B-Instruct 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B-Instruct 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B-Instruct moonshotai_Moonlight-16B-A3B-Instruct: 0h 0m 3s ✅ Benchmark completed for moonshotai_Moonlight-16B-A3B-Instruct 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B-Instruct 🔥 Starting benchmark for Qwen_Qwen3-14B 🔥 Starting benchmark for Qwen_Qwen2.5-14B-Instruct 🔥 Starting benchmark for Qwen_Qwen-7B-Chat Qwen_Qwen-7B-Chat: 0h 4m 49s ✅ Benchmark completed for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B Qwen_Qwen-7B: 0h 4m 53s ✅ Benchmark completed for Qwen_Qwen-7B 🔥 Starting benchmark for baichuan-inc_Baichuan2-13B-Chat baichuan-inc_Baichuan2-13B-Chat: 0h 5m 31s ✅ Benchmark completed for baichuan-inc_Baichuan2-13B-Chat 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B-Instruct moonshotai_Moonlight-16B-A3B-Instruct: 0h 5m 27s ✅ Benchmark completed for moonshotai_Moonlight-16B-A3B-Instruct 🔥 Starting benchmark for moonshotai_Moonlight-16B-A3B moonshotai_Moonlight-16B-A3B: 0h 5m 26s ✅ Benchmark completed for moonshotai_Moonlight-16B-A3B 🔥 Starting benchmark for Qwen_Qwen3-14B Qwen_Qwen3-14B: 0h 5m 11s ✅ Benchmark completed for Qwen_Qwen3-14B 🔥 Starting benchmark for Qwen_Qwen2.5-14B-Instruct 🔥 Starting benchmark for openai-community_gpt2 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B 🔥 Starting benchmark for openai-community_gpt2 🔥 Starting benchmark for openai-community_gpt2 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 32 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 32 openai-community_gpt2: 0h 28m 26s ✅ Benchmark completed for openai-community_gpt2 🔥 Starting benchmark for openai-community_gpt2-medium Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 28 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 28 openai-community_gpt2-medium: 0h 51m 53s ✅ Benchmark completed for openai-community_gpt2-medium 🔥 Starting benchmark for openai-community_gpt2-large Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 19 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 19 openai-community_gpt2-large: 1h 27m 10s ✅ Benchmark completed for openai-community_gpt2-large 🔥 Starting benchmark for openai-community_gpt2-xl Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 13 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 13 openai-community_gpt2-xl: 2h 30m 7s ✅ Benchmark completed for openai-community_gpt2-xl 🔥 Starting benchmark for Qwen_Qwen2.5-Math-1.5B-Instruct Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 6 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 6 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-Math-1.5B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (6,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3420|± |0.0150| |anli_r2 | 1|none | 0|acc |↑ | 0.3410|± |0.0150| |anli_r3 | 1|none | 0|acc |↑ | 0.3533|± |0.0138| |arc_challenge | 1|none | 0|acc |↑ | 0.3336|± |0.0138| | | |none | 0|acc_norm |↑ | 0.3652|± |0.0141| |bbh | 3|get-answer | |exact_match|↑ | 0.4373|± |0.0052| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8760|± |0.0209| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4706|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5680|± |0.0314| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.5280|± |0.0316| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.4960|± |0.0317| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.1440|± |0.0222| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5360|± |0.0316| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7040|± |0.0289| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.2840|± |0.0286| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.8080|± |0.0250| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8680|± |0.0215| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.6027|± |0.0406| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5760|± |0.0313| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0480|± |0.0135| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2120|± |0.0259| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4775|± |0.0375| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.4800|± |0.0317| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1840|± |0.0246| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2680|± |0.0281| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.2040|± |0.0255| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.8120|± |0.0248| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0240|± |0.0097| |boolq | 2|none | 0|acc |↑ | 0.5694|± |0.0087| |drop | 3|none | 0|em |↑ | 0.0002|± |0.0001| | | |none | 0|f1 |↑ | 0.0231|± |0.0007| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1162|± |0.0228| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0960|± |0.0210| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2172|± |0.0294| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2828|± |0.0321| | | |none | 0|acc_norm |↑ | 0.2828|± |0.0321| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3030|± |0.0327| | | |none | 0|acc_norm |↑ | 0.3030|± |0.0327| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0971|± |0.0127| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1282|± |0.0143| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2527|± |0.0186| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3168|± |0.0199| | | |none | 0|acc_norm |↑ | 0.3168|± |0.0199| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2821|± |0.0193| | | |none | 0|acc_norm |↑ | 0.2821|± |0.0193| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1004|± |0.0142| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1094|± |0.0148| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2433|± |0.0203| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3013|± |0.0217| | | |none | 0|acc_norm |↑ | 0.3013|± |0.0217| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2835|± |0.0213| | | |none | 0|acc_norm |↑ | 0.2835|± |0.0213| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7415|± |0.0121| | | |strict-match | 5|exact_match|↑ | 0.7369|± |0.0121| |hellaswag | 1|none | 0|acc |↑ | 0.3530|± |0.0048| | | |none | 0|acc_norm |↑ | 0.4166|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.3788|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.3271|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.4762|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.3758|± |0.0378| | - high_school_us_history | 1|none | 0|acc |↑ | 0.3873|± |0.0342| | - high_school_world_history | 1|none | 0|acc |↑ | 0.3755|± |0.0315| | - international_law | 1|none | 0|acc |↑ | 0.4959|± |0.0456| | - jurisprudence | 1|none | 0|acc |↑ | 0.3241|± |0.0452| | - logical_fallacies | 1|none | 0|acc |↑ | 0.3988|± |0.0385| | - moral_disputes | 1|none | 0|acc |↑ | 0.3699|± |0.0260| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2469|± |0.0144| | - philosophy | 1|none | 0|acc |↑ | 0.3826|± |0.0276| | - prehistory | 1|none | 0|acc |↑ | 0.3457|± |0.0265| | - professional_law | 1|none | 0|acc |↑ | 0.3025|± |0.0117| | - world_religions | 1|none | 0|acc |↑ | 0.2632|± |0.0338| | - other | 2|none | |acc |↑ | 0.3746|± |0.0086| | - business_ethics | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.3736|± |0.0298| | - college_medicine | 1|none | 0|acc |↑ | 0.4046|± |0.0374| | - global_facts | 1|none | 0|acc |↑ | 0.2600|± |0.0441| | - human_aging | 1|none | 0|acc |↑ | 0.3677|± |0.0324| | - management | 1|none | 0|acc |↑ | 0.4951|± |0.0495| | - marketing | 1|none | 0|acc |↑ | 0.5513|± |0.0326| | - medical_genetics | 1|none | 0|acc |↑ | 0.4700|± |0.0502| | - miscellaneous | 1|none | 0|acc |↑ | 0.3729|± |0.0173| | - nutrition | 1|none | 0|acc |↑ | 0.4281|± |0.0283| | - professional_accounting | 1|none | 0|acc |↑ | 0.3014|± |0.0274| | - professional_medicine | 1|none | 0|acc |↑ | 0.2426|± |0.0260| | - virology | 1|none | 0|acc |↑ | 0.3434|± |0.0370| | - social sciences | 2|none | |acc |↑ | 0.4127|± |0.0088| | - econometrics | 1|none | 0|acc |↑ | 0.3596|± |0.0451| | - high_school_geography | 1|none | 0|acc |↑ | 0.3434|± |0.0338| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.3627|± |0.0347| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4282|± |0.0251| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4874|± |0.0325| | - high_school_psychology | 1|none | 0|acc |↑ | 0.4532|± |0.0213| | - human_sexuality | 1|none | 0|acc |↑ | 0.3893|± |0.0428| | - professional_psychology | 1|none | 0|acc |↑ | 0.3578|± |0.0194| | - public_relations | 1|none | 0|acc |↑ | 0.3727|± |0.0463| | - security_studies | 1|none | 0|acc |↑ | 0.4163|± |0.0316| | - sociology | 1|none | 0|acc |↑ | 0.4925|± |0.0354| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.4900|± |0.0502| | - stem | 2|none | |acc |↑ | 0.4269|± |0.0088| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - anatomy | 1|none | 0|acc |↑ | 0.3185|± |0.0402| | - astronomy | 1|none | 0|acc |↑ | 0.3816|± |0.0395| | - college_biology | 1|none | 0|acc |↑ | 0.3125|± |0.0388| | - college_chemistry | 1|none | 0|acc |↑ | 0.3500|± |0.0479| | - college_computer_science | 1|none | 0|acc |↑ | 0.4900|± |0.0502| | - college_mathematics | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - college_physics | 1|none | 0|acc |↑ | 0.3431|± |0.0472| | - computer_security | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4638|± |0.0326| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4759|± |0.0416| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.5344|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.4387|± |0.0282| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4187|± |0.0347| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.5500|± |0.0500| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3778|± |0.0296| | - high_school_physics | 1|none | 0|acc |↑ | 0.3510|± |0.0390| | - high_school_statistics | 1|none | 0|acc |↑ | 0.4583|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.3929|± |0.0464| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0039|± |0.0010| |openbookqa | 1|none | 0|acc |↑ | 0.1980|± |0.0178| | | |none | 0|acc_norm |↑ | 0.2860|± |0.0202| |piqa | 1|none | 0|acc |↑ | 0.6115|± |0.0114| | | |none | 0|acc_norm |↑ | 0.6137|± |0.0114| |qnli | 1|none | 0|acc |↑ | 0.4973|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.7550|± |0.0136| | | |none | 0|acc_norm |↑ | 0.7180|± |0.0142| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0043|± |0.0005| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3672|± |0.0169| | | |none | 0|bleu_diff |↑ |-0.3104|± |0.3272| | | |none | 0|bleu_max |↑ |10.0271|± |0.4450| | | |none | 0|rouge1_acc |↑ | 0.4162|± |0.0173| | | |none | 0|rouge1_diff|↑ | 0.4141|± |0.5503| | | |none | 0|rouge1_max |↑ |30.4103|± |0.7236| | | |none | 0|rouge2_acc |↑ | 0.2632|± |0.0154| | | |none | 0|rouge2_diff|↑ |-0.8895|± |0.6038| | | |none | 0|rouge2_max |↑ |17.4804|± |0.7257| | | |none | 0|rougeL_acc |↑ | 0.4002|± |0.0172| | | |none | 0|rougeL_diff|↑ | 0.0440|± |0.5522| | | |none | 0|rougeL_max |↑ |28.4832|± |0.7201| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2901|± |0.0159| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4895|± |0.0159| |winogrande | 1|none | 0|acc |↑ | 0.5257|± |0.0140| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4373|± |0.0052| |mmlu | 2|none | |acc |↑ |0.3788|± |0.0041| | - humanities | 2|none | |acc |↑ |0.3271|± |0.0068| | - other | 2|none | |acc |↑ |0.3746|± |0.0086| | - social sciences| 2|none | |acc |↑ |0.4127|± |0.0088| | - stem | 2|none | |acc |↑ |0.4269|± |0.0088| Qwen_Qwen2.5-Math-1.5B-Instruct: 3h 25m 33s ✅ Benchmark completed for Qwen_Qwen2.5-Math-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-3B-Instruct Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 2 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-3B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (2,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5620|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4660|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.4942|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.4590|± |0.0146| | | |none | 0|acc_norm |↑ | 0.4821|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.2491|± |0.0041| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.2320|± |0.0268| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.0053|± |0.0053| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4120|± |0.0312| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.0320|± |0.0112| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.2520|± |0.0275| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0920|± |0.0183| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.0520|± |0.0141| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5280|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.2329|± |0.0351| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.6800|± |0.0296| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1160|± |0.0203| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.6400|± |0.0304| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| |boolq | 2|none | 0|acc |↑ | 0.8012|± |0.0070| |drop | 3|none | 0|em |↑ | 0.0016|± |0.0004| | | |none | 0|f1 |↑ | 0.0773|± |0.0014| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1010|± |0.0215| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0909|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1667|± |0.0266| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3232|± |0.0333| | | |none | 0|acc_norm |↑ | 0.3232|± |0.0333| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3081|± |0.0329| | | |none | 0|acc_norm |↑ | 0.3081|± |0.0329| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1538|± |0.0155| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1081|± |0.0133| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1996|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3278|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3278|± |0.0201| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1183|± |0.0153| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1161|± |0.0152| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1763|± |0.0180| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2812|± |0.0213| | | |none | 0|acc_norm |↑ | 0.2812|± |0.0213| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3214|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3214|± |0.0221| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6384|± |0.0132| | | |strict-match | 5|exact_match|↑ | 0.1016|± |0.0083| |hellaswag | 1|none | 0|acc |↑ | 0.5633|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7491|± |0.0043| |mmlu | 2|none | |acc |↑ | 0.6550|± |0.0038| | - humanities | 2|none | |acc |↑ | 0.5858|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.4603|± |0.0446| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8061|± |0.0309| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8333|± |0.0262| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8523|± |0.0231| | - international_law | 1|none | 0|acc |↑ | 0.7851|± |0.0375| | - jurisprudence | 1|none | 0|acc |↑ | 0.7778|± |0.0402| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7975|± |0.0316| | - moral_disputes | 1|none | 0|acc |↑ | 0.6763|± |0.0252| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3374|± |0.0158| | - philosophy | 1|none | 0|acc |↑ | 0.7074|± |0.0258| | - prehistory | 1|none | 0|acc |↑ | 0.7315|± |0.0247| | - professional_law | 1|none | 0|acc |↑ | 0.4896|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.8187|± |0.0295| | - other | 2|none | |acc |↑ | 0.7023|± |0.0079| | - business_ethics | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7094|± |0.0279| | - college_medicine | 1|none | 0|acc |↑ | 0.6474|± |0.0364| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.7040|± |0.0306| | - management | 1|none | 0|acc |↑ | 0.7864|± |0.0406| | - marketing | 1|none | 0|acc |↑ | 0.8846|± |0.0209| | - medical_genetics | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - miscellaneous | 1|none | 0|acc |↑ | 0.7982|± |0.0144| | - nutrition | 1|none | 0|acc |↑ | 0.7190|± |0.0257| | - professional_accounting | 1|none | 0|acc |↑ | 0.5355|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6360|± |0.0292| | - virology | 1|none | 0|acc |↑ | 0.4819|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7602|± |0.0076| | - econometrics | 1|none | 0|acc |↑ | 0.4912|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.7828|± |0.0294| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8653|± |0.0246| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6846|± |0.0236| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7815|± |0.0268| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8459|± |0.0155| | - human_sexuality | 1|none | 0|acc |↑ | 0.7481|± |0.0381| | - professional_psychology | 1|none | 0|acc |↑ | 0.7190|± |0.0182| | - public_relations | 1|none | 0|acc |↑ | 0.6818|± |0.0446| | - security_studies | 1|none | 0|acc |↑ | 0.7510|± |0.0277| | - sociology | 1|none | 0|acc |↑ | 0.8308|± |0.0265| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8300|± |0.0378| | - stem | 2|none | |acc |↑ | 0.6089|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.4900|± |0.0502| | - anatomy | 1|none | 0|acc |↑ | 0.6667|± |0.0407| | - astronomy | 1|none | 0|acc |↑ | 0.7368|± |0.0358| | - college_biology | 1|none | 0|acc |↑ | 0.7292|± |0.0372| | - college_chemistry | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - college_computer_science | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_mathematics | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - college_physics | 1|none | 0|acc |↑ | 0.4804|± |0.0497| | - computer_security | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6383|± |0.0314| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6483|± |0.0398| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6032|± |0.0252| | - high_school_biology | 1|none | 0|acc |↑ | 0.8161|± |0.0220| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5813|± |0.0347| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5000|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.4503|± |0.0406| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5880|± |0.0336| | - machine_learning | 1|none | 0|acc |↑ | 0.4911|± |0.0475| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0083|± |0.0015| |openbookqa | 1|none | 0|acc |↑ | 0.3320|± |0.0211| | | |none | 0|acc_norm |↑ | 0.4220|± |0.0221| |piqa | 1|none | 0|acc |↑ | 0.7786|± |0.0097| | | |none | 0|acc_norm |↑ | 0.7807|± |0.0097| |qnli | 1|none | 0|acc |↑ | 0.7979|± |0.0054| |sciq | 1|none | 0|acc |↑ | 0.9460|± |0.0072| | | |none | 0|acc_norm |↑ | 0.9130|± |0.0089| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.3010|± |0.0034| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4835|± |0.0175| | | |none | 0|bleu_diff |↑ |-0.2750|± |0.2661| | | |none | 0|bleu_max |↑ | 7.7071|± |0.3537| | | |none | 0|rouge1_acc |↑ | 0.4847|± |0.0175| | | |none | 0|rouge1_diff|↑ |-0.1394|± |0.4206| | | |none | 0|rouge1_max |↑ |26.4054|± |0.5770| | | |none | 0|rouge2_acc |↑ | 0.3880|± |0.0171| | | |none | 0|rouge2_diff|↑ |-0.7135|± |0.4378| | | |none | 0|rouge2_max |↑ |15.0737|± |0.5493| | | |none | 0|rougeL_acc |↑ | 0.4651|± |0.0175| | | |none | 0|rougeL_diff|↑ |-0.4360|± |0.4149| | | |none | 0|rougeL_max |↑ |23.4471|± |0.5573| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4162|± |0.0173| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5861|± |0.0157| |winogrande | 1|none | 0|acc |↑ | 0.6930|± |0.0130| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.2491|± |0.0041| |mmlu | 2|none | |acc |↑ |0.6550|± |0.0038| | - humanities | 2|none | |acc |↑ |0.5858|± |0.0067| | - other | 2|none | |acc |↑ |0.7023|± |0.0079| | - social sciences| 2|none | |acc |↑ |0.7602|± |0.0076| | - stem | 2|none | |acc |↑ |0.6089|± |0.0085| Qwen_Qwen2.5-3B-Instruct: 7h 48m 19s ✅ Benchmark completed for Qwen_Qwen2.5-3B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 1 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 Passed argument batch_size = auto. Detecting largest batch size Determined Largest batch size: 1 hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-1.5B-Instruct,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: auto:4 (1,64,64,64,64) | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4500|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.3940|± |0.0155| |anli_r3 | 1|none | 0|acc |↑ | 0.4325|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.4394|± |0.0145| | | |none | 0|acc_norm |↑ | 0.4659|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.3861|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.5080|± |0.0367| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4040|± |0.0311| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.3880|± |0.0309| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0200|± |0.0089| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3120|± |0.0294| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.5880|± |0.0312| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5320|± |0.0316| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.5920|± |0.0311| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4795|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5200|± |0.0317| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3480|± |0.0302| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3440|± |0.0301| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4831|± |0.0376| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.6600|± |0.0300| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1640|± |0.0235| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1440|± |0.0222| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.4080|± |0.0311| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.0560|± |0.0146| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| |boolq | 2|none | 0|acc |↑ | 0.7810|± |0.0072| |drop | 3|none | 0|em |↑ | 0.0018|± |0.0004| | | |none | 0|f1 |↑ | 0.0391|± |0.0011| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1869|± |0.0278| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1919|± |0.0281| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2929|± |0.0324| | | |none | 0|acc_norm |↑ | 0.2929|± |0.0324| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2273|± |0.0299| | | |none | 0|acc_norm |↑ | 0.2273|± |0.0299| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1941|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1465|± |0.0151| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2179|± |0.0177| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3077|± |0.0198| | | |none | 0|acc_norm |↑ | 0.3077|± |0.0198| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1674|± |0.0177| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1496|± |0.0169| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2232|± |0.0197| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3237|± |0.0221| | | |none | 0|acc_norm |↑ | 0.3237|± |0.0221| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2857|± |0.0214| | | |none | 0|acc_norm |↑ | 0.2857|± |0.0214| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.5201|± |0.0138| | | |strict-match | 5|exact_match|↑ | 0.3025|± |0.0127| |hellaswag | 1|none | 0|acc |↑ | 0.5087|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6827|± |0.0046| |mmlu | 2|none | |acc |↑ | 0.6003|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5409|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.5000|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7576|± |0.0335| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7255|± |0.0313| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7722|± |0.0273| | - international_law | 1|none | 0|acc |↑ | 0.7355|± |0.0403| | - jurisprudence | 1|none | 0|acc |↑ | 0.7963|± |0.0389| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7669|± |0.0332| | - moral_disputes | 1|none | 0|acc |↑ | 0.6474|± |0.0257| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2972|± |0.0153| | - philosophy | 1|none | 0|acc |↑ | 0.6656|± |0.0268| | - prehistory | 1|none | 0|acc |↑ | 0.6728|± |0.0261| | - professional_law | 1|none | 0|acc |↑ | 0.4394|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.8012|± |0.0306| | - other | 2|none | |acc |↑ | 0.6453|± |0.0083| | - business_ethics | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6642|± |0.0291| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.2100|± |0.0409| | - human_aging | 1|none | 0|acc |↑ | 0.6278|± |0.0324| | - management | 1|none | 0|acc |↑ | 0.8155|± |0.0384| | - marketing | 1|none | 0|acc |↑ | 0.8376|± |0.0242| | - medical_genetics | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - miscellaneous | 1|none | 0|acc |↑ | 0.7241|± |0.0160| | - nutrition | 1|none | 0|acc |↑ | 0.6732|± |0.0269| | - professional_accounting | 1|none | 0|acc |↑ | 0.4610|± |0.0297| | - professional_medicine | 1|none | 0|acc |↑ | 0.6029|± |0.0297| | - virology | 1|none | 0|acc |↑ | 0.4398|± |0.0386| | - social sciences | 2|none | |acc |↑ | 0.7085|± |0.0080| | - econometrics | 1|none | 0|acc |↑ | 0.4474|± |0.0468| | - high_school_geography | 1|none | 0|acc |↑ | 0.7677|± |0.0301| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.7979|± |0.0290| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6769|± |0.0237| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7185|± |0.0292| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8275|± |0.0162| | - human_sexuality | 1|none | 0|acc |↑ | 0.7328|± |0.0388| | - professional_psychology | 1|none | 0|acc |↑ | 0.6013|± |0.0198| | - public_relations | 1|none | 0|acc |↑ | 0.5909|± |0.0471| | - security_studies | 1|none | 0|acc |↑ | 0.6980|± |0.0294| | - sociology | 1|none | 0|acc |↑ | 0.8060|± |0.0280| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - stem | 2|none | |acc |↑ | 0.5392|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3700|± |0.0485| | - anatomy | 1|none | 0|acc |↑ | 0.5333|± |0.0431| | - astronomy | 1|none | 0|acc |↑ | 0.7105|± |0.0369| | - college_biology | 1|none | 0|acc |↑ | 0.6528|± |0.0398| | - college_chemistry | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_computer_science | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - college_mathematics | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_physics | 1|none | 0|acc |↑ | 0.4608|± |0.0496| | - computer_security | 1|none | 0|acc |↑ | 0.7300|± |0.0446| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5787|± |0.0323| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6069|± |0.0407| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4947|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7613|± |0.0243| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4926|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4148|± |0.0300| | - high_school_physics | 1|none | 0|acc |↑ | 0.3709|± |0.0394| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5463|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.4018|± |0.0465| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0421|± |0.0033| |openbookqa | 1|none | 0|acc |↑ | 0.3160|± |0.0208| | | |none | 0|acc_norm |↑ | 0.4040|± |0.0220| |piqa | 1|none | 0|acc |↑ | 0.7633|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7595|± |0.0100| |qnli | 1|none | 0|acc |↑ | 0.5660|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9490|± |0.0070| | | |none | 0|acc_norm |↑ | 0.9400|± |0.0075| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2801|± |0.0034| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3966|± |0.0171| | | |none | 0|bleu_diff |↑ |-1.7922|± |0.4018| | | |none | 0|bleu_max |↑ |12.2414|± |0.5454| | | |none | 0|rouge1_acc |↑ | 0.4272|± |0.0173| | | |none | 0|rouge1_diff|↑ |-3.2519|± |0.6342| | | |none | 0|rouge1_max |↑ |32.3672|± |0.7715| | | |none | 0|rouge2_acc |↑ | 0.2742|± |0.0156| | | |none | 0|rouge2_diff|↑ |-3.5921|± |0.6461| | | |none | 0|rouge2_max |↑ |17.5177|± |0.7544| | | |none | 0|rougeL_acc |↑ | 0.4235|± |0.0173| | | |none | 0|rougeL_diff|↑ |-3.5600|± |0.6371| | | |none | 0|rougeL_max |↑ |29.5881|± |0.7560| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3121|± |0.0162| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4657|± |0.0150| |winogrande | 1|none | 0|acc |↑ | 0.6290|± |0.0136| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.3861|± |0.0054| |mmlu | 2|none | |acc |↑ |0.6003|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5409|± |0.0068| | - other | 2|none | |acc |↑ |0.6453|± |0.0083| | - social sciences| 2|none | |acc |↑ |0.7085|± |0.0080| | - stem | 2|none | |acc |↑ |0.5392|± |0.0086| Qwen_Qwen2.5-1.5B-Instruct: 5h 38m 25s ✅ Benchmark completed for Qwen_Qwen2.5-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-0.5B-Instruct Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 2 Passed argument batch_size = auto:4.0. Detecting largest batch size Determined largest batch size: 64 🔥 Starting benchmark for deepseek-ai_deepseek-moe-16b-base 🔥 Starting benchmark for deepseek-ai_deepseek-moe-16b-base deepseek-ai_deepseek-moe-16b-base: 0h 5m 31s ✅ Benchmark completed for deepseek-ai_deepseek-moe-16b-base 🔥 Starting benchmark for Deepseek-ai_deepseek-moe-16b-chat Deepseek-ai_deepseek-moe-16b-chat: 0h 0m 4s ✅ Benchmark completed for Deepseek-ai_deepseek-moe-16b-chat 🔥 Starting benchmark for Qwen_Qwen-7B-Chat Qwen_Qwen-7B-Chat: 0h 5m 5s ✅ Benchmark completed for Qwen_Qwen-7B-Chat 🔥 Starting benchmark for Qwen_Qwen-7B Qwen_Qwen-7B: 0h 5m 8s ✅ Benchmark completed for Qwen_Qwen-7B 🔥 Starting benchmark for baichuan-inc_Baichuan2-13B-Chat baichuan-inc_Baichuan2-13B-Chat: 1h 19m 49s ✅ Benchmark completed for baichuan-inc_Baichuan2-13B-Chat 🔥 Starting benchmark for Qwen_Qwen2.5-14B-Instruct hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-14B-Instruct,trust_remote_code=True,load_in_8bit=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.7210|± |0.0142| |anli_r2 | 1|none | 0|acc |↑ | 0.6340|± |0.0152| |anli_r3 | 1|none | 0|acc |↑ | 0.6175|± |0.0140| |arc_challenge | 1|none | 0|acc |↑ | 0.6067|± |0.0143| | | |none | 0|acc_norm |↑ | 0.6152|± |0.0142| |bbh | 3|get-answer | |exact_match|↑ | 0.1069|± |0.0032| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.2995|± |0.0336| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.0560|± |0.0146| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0680|± |0.0160| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.0120|± |0.0069| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.1400|± |0.0220| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.0120|± |0.0069| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.0280|± |0.0105| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.1480|± |0.0225| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.5400|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.1027|± |0.0252| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4600|± |0.0316| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.0120|± |0.0069| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.0056|± |0.0056| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.0320|± |0.0112| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.0120|± |0.0069| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.0040|± |0.0040| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| |boolq | 2|none | 0|acc |↑ | 0.8862|± |0.0056| |drop | 3|none | 0|em |↑ | 0.0002|± |0.0001| | | |none | 0|f1 |↑ | 0.0713|± |0.0012| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1414|± |0.0248| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1111|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2424|± |0.0305| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.4091|± |0.0350| | | |none | 0|acc_norm |↑ | 0.4091|± |0.0350| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3434|± |0.0338| | | |none | 0|acc_norm |↑ | 0.3434|± |0.0338| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1593|± |0.0157| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1337|± |0.0146| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2930|± |0.0195| | | |strict-match | 0|exact_match|↑ | 0.0037|± |0.0026| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3755|± |0.0207| | | |none | 0|acc_norm |↑ | 0.3755|± |0.0207| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3810|± |0.0208| | | |none | 0|acc_norm |↑ | 0.3810|± |0.0208| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1295|± |0.0159| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1161|± |0.0152| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2812|± |0.0213| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.4107|± |0.0233| | | |none | 0|acc_norm |↑ | 0.4107|± |0.0233| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3549|± |0.0226| | | |none | 0|acc_norm |↑ | 0.3549|± |0.0226| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.4390|± |0.0137| | | |strict-match | 5|exact_match|↑ | 0.7923|± |0.0112| |hellaswag | 1|none | 0|acc |↑ | 0.6527|± |0.0048| | | |none | 0|acc_norm |↑ | 0.8420|± |0.0036| |mmlu | 2|none | |acc |↑ | 0.7831|± |0.0033| | - humanities | 2|none | |acc |↑ | 0.7214|± |0.0062| | - formal_logic | 1|none | 0|acc |↑ | 0.6349|± |0.0431| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8606|± |0.0270| | - high_school_us_history | 1|none | 0|acc |↑ | 0.9118|± |0.0199| | - high_school_world_history | 1|none | 0|acc |↑ | 0.9072|± |0.0189| | - international_law | 1|none | 0|acc |↑ | 0.9008|± |0.0273| | - jurisprudence | 1|none | 0|acc |↑ | 0.8519|± |0.0343| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8834|± |0.0252| | - moral_disputes | 1|none | 0|acc |↑ | 0.7977|± |0.0216| | - moral_scenarios | 1|none | 0|acc |↑ | 0.6525|± |0.0159| | - philosophy | 1|none | 0|acc |↑ | 0.8199|± |0.0218| | - prehistory | 1|none | 0|acc |↑ | 0.8858|± |0.0177| | - professional_law | 1|none | 0|acc |↑ | 0.5678|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.8947|± |0.0235| | - other | 2|none | |acc |↑ | 0.8104|± |0.0068| | - business_ethics | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.8340|± |0.0229| | - college_medicine | 1|none | 0|acc |↑ | 0.7399|± |0.0335| | - global_facts | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - human_aging | 1|none | 0|acc |↑ | 0.7803|± |0.0278| | - management | 1|none | 0|acc |↑ | 0.8835|± |0.0318| | - marketing | 1|none | 0|acc |↑ | 0.9274|± |0.0170| | - medical_genetics | 1|none | 0|acc |↑ | 0.9000|± |0.0302| | - miscellaneous | 1|none | 0|acc |↑ | 0.9055|± |0.0105| | - nutrition | 1|none | 0|acc |↑ | 0.8235|± |0.0218| | - professional_accounting | 1|none | 0|acc |↑ | 0.6348|± |0.0287| | - professional_medicine | 1|none | 0|acc |↑ | 0.8456|± |0.0220| | - virology | 1|none | 0|acc |↑ | 0.5482|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.8635|± |0.0061| | - econometrics | 1|none | 0|acc |↑ | 0.6842|± |0.0437| | - high_school_geography | 1|none | 0|acc |↑ | 0.9141|± |0.0200| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9534|± |0.0152| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.8590|± |0.0176| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.9034|± |0.0192| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9046|± |0.0126| | - human_sexuality | 1|none | 0|acc |↑ | 0.8626|± |0.0302| | - professional_psychology | 1|none | 0|acc |↑ | 0.8137|± |0.0158| | - public_relations | 1|none | 0|acc |↑ | 0.7727|± |0.0401| | - security_studies | 1|none | 0|acc |↑ | 0.8327|± |0.0239| | - sociology | 1|none | 0|acc |↑ | 0.8955|± |0.0216| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.9100|± |0.0288| | - stem | 2|none | |acc |↑ | 0.7697|± |0.0073| | - abstract_algebra | 1|none | 0|acc |↑ | 0.6300|± |0.0485| | - anatomy | 1|none | 0|acc |↑ | 0.7630|± |0.0367| | - astronomy | 1|none | 0|acc |↑ | 0.9079|± |0.0235| | - college_biology | 1|none | 0|acc |↑ | 0.8958|± |0.0255| | - college_chemistry | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - college_computer_science | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - college_mathematics | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - college_physics | 1|none | 0|acc |↑ | 0.6373|± |0.0478| | - computer_security | 1|none | 0|acc |↑ | 0.8000|± |0.0402| | - conceptual_physics | 1|none | 0|acc |↑ | 0.8340|± |0.0243| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7448|± |0.0363| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.8624|± |0.0177| | - high_school_biology | 1|none | 0|acc |↑ | 0.9065|± |0.0166| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.6897|± |0.0326| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.9000|± |0.0302| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.6444|± |0.0292| | - high_school_physics | 1|none | 0|acc |↑ | 0.7351|± |0.0360| | - high_school_statistics | 1|none | 0|acc |↑ | 0.7824|± |0.0281| | - machine_learning | 1|none | 0|acc |↑ | 0.6518|± |0.0452| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0615|± |0.0040| |openbookqa | 1|none | 0|acc |↑ | 0.3700|± |0.0216| | | |none | 0|acc_norm |↑ | 0.4760|± |0.0224| |piqa | 1|none | 0|acc |↑ | 0.8058|± |0.0092| | | |none | 0|acc_norm |↑ | 0.8172|± |0.0090| |qnli | 1|none | 0|acc |↑ | 0.8539|± |0.0048| |sciq | 1|none | 0|acc |↑ | 0.9640|± |0.0059| | | |none | 0|acc_norm |↑ | 0.9290|± |0.0081| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0393|± |0.0015| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5141|± |0.0175| | | |none | 0|bleu_diff |↑ | 0.9069|± |0.2407| | | |none | 0|bleu_max |↑ | 7.7993|± |0.4042| | | |none | 0|rouge1_acc |↑ | 0.5386|± |0.0175| | | |none | 0|rouge1_diff|↑ | 1.7284|± |0.4077| | | |none | 0|rouge1_max |↑ |25.5442|± |0.6360| | | |none | 0|rouge2_acc |↑ | 0.4541|± |0.0174| | | |none | 0|rouge2_diff|↑ | 1.1688|± |0.4042| | | |none | 0|rouge2_max |↑ |14.7681|± |0.5874| | | |none | 0|rougeL_acc |↑ | 0.5239|± |0.0175| | | |none | 0|rougeL_diff|↑ | 1.4904|± |0.3881| | | |none | 0|rougeL_max |↑ |22.4469|± |0.6199| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.5104|± |0.0175| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.6830|± |0.0150| |winogrande | 1|none | 0|acc |↑ | 0.7545|± |0.0121| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.1069|± |0.0032| |mmlu | 2|none | |acc |↑ |0.7831|± |0.0033| | - humanities | 2|none | |acc |↑ |0.7214|± |0.0062| | - other | 2|none | |acc |↑ |0.8104|± |0.0068| | - social sciences| 2|none | |acc |↑ |0.8635|± |0.0061| | - stem | 2|none | |acc |↑ |0.7697|± |0.0073| Qwen_Qwen2.5-14B-Instruct: 52h 44m 39s ✅ Benchmark completed for Qwen_Qwen2.5-14B-Instruct 🔥 Starting benchmark for Qwen_Qwen3-14B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen3-14B,trust_remote_code=True,load_in_8bit=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 1 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.6460|± |0.0151| |anli_r2 | 1|none | 0|acc |↑ | 0.5700|± |0.0157| |anli_r3 | 1|none | 0|acc |↑ | 0.5567|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.5870|± |0.0144| | | |none | 0|acc_norm |↑ | 0.6007|± |0.0143| |bbh | 3|get-answer | |exact_match|↑ | 0.4330|± |0.0048| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9760|± |0.0097| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.0588|± |0.0173| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.3720|± |0.0306| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.2000|± |0.0253| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.4000|± |0.0310| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5120|± |0.0317| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0920|± |0.0183| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9240|± |0.0168| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7840|± |0.0261| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6760|± |0.0297| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.7200|± |0.0285| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5068|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.2000|± |0.0253| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3560|± |0.0303| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.0080|± |0.0056| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.2809|± |0.0338| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8640|± |0.0217| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.6720|± |0.0298| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1120|± |0.0200| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.1080|± |0.0197| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.3760|± |0.0307| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| |boolq | 2|none | 0|acc |↑ | 0.8917|± |0.0054| |drop | 3|none | 0|em |↑ | 0.0035|± |0.0006| | | |none | 0|f1 |↑ | 0.0904|± |0.0018| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0758|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0606|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3939|± |0.0348| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3586|± |0.0342| | | |none | 0|acc_norm |↑ | 0.3586|± |0.0342| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3838|± |0.0346| | | |none | 0|acc_norm |↑ | 0.3838|± |0.0346| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1062|± |0.0132| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0586|± |0.0101| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3590|± |0.0205| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3974|± |0.0210| | | |none | 0|acc_norm |↑ | 0.3974|± |0.0210| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3828|± |0.0208| | | |none | 0|acc_norm |↑ | 0.3828|± |0.0208| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1205|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0781|± |0.0127| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.3415|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3750|± |0.0229| | | |none | 0|acc_norm |↑ | 0.3750|± |0.0229| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3973|± |0.0231| | | |none | 0|acc_norm |↑ | 0.3973|± |0.0231| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8135|± |0.0107| | | |strict-match | 5|exact_match|↑ | 0.8984|± |0.0083| |hellaswag | 1|none | 0|acc |↑ | 0.6084|± |0.0049| | | |none | 0|acc_norm |↑ | 0.7877|± |0.0041| |mmlu | 2|none | |acc |↑ | 0.7695|± |0.0034| | - humanities | 2|none | |acc |↑ | 0.6778|± |0.0065| | - formal_logic | 1|none | 0|acc |↑ | 0.6667|± |0.0422| | - high_school_european_history | 1|none | 0|acc |↑ | 0.8303|± |0.0293| | - high_school_us_history | 1|none | 0|acc |↑ | 0.9069|± |0.0204| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8565|± |0.0228| | - international_law | 1|none | 0|acc |↑ | 0.8430|± |0.0332| | - jurisprudence | 1|none | 0|acc |↑ | 0.8611|± |0.0334| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8773|± |0.0258| | - moral_disputes | 1|none | 0|acc |↑ | 0.7775|± |0.0224| | - moral_scenarios | 1|none | 0|acc |↑ | 0.5665|± |0.0166| | - philosophy | 1|none | 0|acc |↑ | 0.7556|± |0.0244| | - prehistory | 1|none | 0|acc |↑ | 0.8333|± |0.0207| | - professional_law | 1|none | 0|acc |↑ | 0.5293|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.8713|± |0.0257| | - other | 2|none | |acc |↑ | 0.8011|± |0.0069| | - business_ethics | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.8151|± |0.0239| | - college_medicine | 1|none | 0|acc |↑ | 0.7919|± |0.0310| | - global_facts | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - human_aging | 1|none | 0|acc |↑ | 0.7444|± |0.0293| | - management | 1|none | 0|acc |↑ | 0.8350|± |0.0368| | - marketing | 1|none | 0|acc |↑ | 0.9231|± |0.0175| | - medical_genetics | 1|none | 0|acc |↑ | 0.8200|± |0.0386| | - miscellaneous | 1|none | 0|acc |↑ | 0.8838|± |0.0115| | - nutrition | 1|none | 0|acc |↑ | 0.8170|± |0.0221| | - professional_accounting | 1|none | 0|acc |↑ | 0.6454|± |0.0285| | - professional_medicine | 1|none | 0|acc |↑ | 0.8640|± |0.0208| | - virology | 1|none | 0|acc |↑ | 0.5663|± |0.0386| | - social sciences | 2|none | |acc |↑ | 0.8586|± |0.0062| | - econometrics | 1|none | 0|acc |↑ | 0.6842|± |0.0437| | - high_school_geography | 1|none | 0|acc |↑ | 0.8889|± |0.0224| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.9275|± |0.0187| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.8615|± |0.0175| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.9412|± |0.0153| | - high_school_psychology | 1|none | 0|acc |↑ | 0.9266|± |0.0112| | - human_sexuality | 1|none | 0|acc |↑ | 0.8397|± |0.0322| | - professional_psychology | 1|none | 0|acc |↑ | 0.8072|± |0.0160| | - public_relations | 1|none | 0|acc |↑ | 0.7455|± |0.0417| | - security_studies | 1|none | 0|acc |↑ | 0.7918|± |0.0260| | - sociology | 1|none | 0|acc |↑ | 0.8706|± |0.0237| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8900|± |0.0314| | - stem | 2|none | |acc |↑ | 0.7881|± |0.0070| | - abstract_algebra | 1|none | 0|acc |↑ | 0.6300|± |0.0485| | - anatomy | 1|none | 0|acc |↑ | 0.8000|± |0.0346| | - astronomy | 1|none | 0|acc |↑ | 0.8750|± |0.0269| | - college_biology | 1|none | 0|acc |↑ | 0.9028|± |0.0248| | - college_chemistry | 1|none | 0|acc |↑ | 0.5800|± |0.0496| | - college_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_mathematics | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_physics | 1|none | 0|acc |↑ | 0.6569|± |0.0472| | - computer_security | 1|none | 0|acc |↑ | 0.8500|± |0.0359| | - conceptual_physics | 1|none | 0|acc |↑ | 0.8723|± |0.0218| | - electrical_engineering | 1|none | 0|acc |↑ | 0.8276|± |0.0315| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.8545|± |0.0182| | - high_school_biology | 1|none | 0|acc |↑ | 0.9387|± |0.0136| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.7833|± |0.0290| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8900|± |0.0314| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.5963|± |0.0299| | - high_school_physics | 1|none | 0|acc |↑ | 0.7550|± |0.0351| | - high_school_statistics | 1|none | 0|acc |↑ | 0.7917|± |0.0277| | - machine_learning | 1|none | 0|acc |↑ | 0.6607|± |0.0449| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0922|± |0.0048| |openbookqa | 1|none | 0|acc |↑ | 0.3420|± |0.0212| | | |none | 0|acc_norm |↑ | 0.4600|± |0.0223| |piqa | 1|none | 0|acc |↑ | 0.7916|± |0.0095| | | |none | 0|acc_norm |↑ | 0.7949|± |0.0094| |qnli | 1|none | 0|acc |↑ | 0.8442|± |0.0049| |sciq | 1|none | 0|acc |↑ | 0.9770|± |0.0047| | | |none | 0|acc_norm |↑ | 0.9660|± |0.0057| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.4075|± |0.0037| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.6744|± |0.0164| | | |none | 0|bleu_diff |↑ |20.9239|± |1.1232| | | |none | 0|bleu_max |↑ |38.9157|± |0.8760| | | |none | 0|rouge1_acc |↑ | 0.6818|± |0.0163| | | |none | 0|rouge1_diff|↑ |29.7410|± |1.5930| | | |none | 0|rouge1_max |↑ |65.2284|± |0.9593| | | |none | 0|rouge2_acc |↑ | 0.6267|± |0.0169| | | |none | 0|rouge2_diff|↑ |30.9999|± |1.7090| | | |none | 0|rouge2_max |↑ |54.1666|± |1.1923| | | |none | 0|rougeL_acc |↑ | 0.6756|± |0.0164| | | |none | 0|rougeL_diff|↑ |29.8027|± |1.6052| | | |none | 0|rougeL_max |↑ |63.4625|± |0.9960| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.4064|± |0.0172| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5894|± |0.0154| |winogrande | 1|none | 0|acc |↑ | 0.7206|± |0.0126| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4330|± |0.0048| |mmlu | 2|none | |acc |↑ |0.7695|± |0.0034| | - humanities | 2|none | |acc |↑ |0.6778|± |0.0065| | - other | 2|none | |acc |↑ |0.8011|± |0.0069| | - social sciences| 2|none | |acc |↑ |0.8586|± |0.0062| | - stem | 2|none | |acc |↑ |0.7881|± |0.0070| Qwen_Qwen3-14B: 29h 46m 1s ✅ Benchmark completed for Qwen_Qwen3-14B 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct Qwen_Qwen2.5-1.5B-Instruct: 0h 5m 18s ✅ Benchmark completed for Qwen_Qwen2.5-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-0.5B-Instruct Qwen_Qwen2.5-0.5B-Instruct: 0h 5m 0s ✅ Benchmark completed for Qwen_Qwen2.5-0.5B-Instruct 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B: 0h 5m 19s ✅ Benchmark completed for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B 🔥 Starting benchmark for Qwen_Qwen3-1.7B Qwen_Qwen3-1.7B: 0h 5m 18s ✅ Benchmark completed for Qwen_Qwen3-1.7B 🔥 Starting benchmark for Qwen_Qwen3-0.6B Qwen_Qwen3-0.6B: 0h 5m 0s ✅ Benchmark completed for Qwen_Qwen3-0.6B 🔥 Starting benchmark for Qwen_Qwen3-4B Qwen_Qwen3-4B: 0h 5m 7s ✅ Benchmark completed for Qwen_Qwen3-4B 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct Qwen_Qwen2.5-1.5B-Instruct: 0h 0m 3s ✅ Benchmark completed for Qwen_Qwen2.5-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-0.5B-Instruct Qwen_Qwen2.5-0.5B-Instruct: 0h 0m 3s ✅ Benchmark completed for Qwen_Qwen2.5-0.5B-Instruct 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct Passed argument batch_size = auto:1. Detecting largest batch size Determined largest batch size: 1 🔥 Starting benchmark for Qwen_Qwen2.5-1.5B-Instruct hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-1.5B-Instruct,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4480|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.3920|± |0.0154| |anli_r3 | 1|none | 0|acc |↑ | 0.4317|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.4352|± |0.0145| | | |none | 0|acc_norm |↑ | 0.4684|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.3692|± |0.0054| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4813|± |0.0366| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.2920|± |0.0288| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.4000|± |0.0310| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1920|± |0.0250| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3160|± |0.0295| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.5240|± |0.0316| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.4720|± |0.0316| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.6040|± |0.0310| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.6520|± |0.0302| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5068|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.4600|± |0.0316| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.3320|± |0.0298| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.3440|± |0.0301| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4438|± |0.0373| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1120|± |0.0200| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1520|± |0.0228| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1040|± |0.0193| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0440|± |0.0130| |boolq | 2|none | 0|acc |↑ | 0.7813|± |0.0072| |drop | 3|none | 0|em |↑ | 0.0017|± |0.0004| | | |none | 0|f1 |↑ | 0.0391|± |0.0011| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1566|± |0.0259| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1768|± |0.0272| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2929|± |0.0324| | | |none | 0|acc_norm |↑ | 0.2929|± |0.0324| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2323|± |0.0301| | | |none | 0|acc_norm |↑ | 0.2323|± |0.0301| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1630|± |0.0158| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1520|± |0.0154| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2271|± |0.0179| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3114|± |0.0198| | | |none | 0|acc_norm |↑ | 0.3114|± |0.0198| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2985|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2985|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1987|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1228|± |0.0155| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2165|± |0.0195| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3259|± |0.0222| | | |none | 0|acc_norm |↑ | 0.3259|± |0.0222| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2835|± |0.0213| | | |none | 0|acc_norm |↑ | 0.2835|± |0.0213| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.5095|± |0.0138| | | |strict-match | 5|exact_match|↑ | 0.3192|± |0.0128| |hellaswag | 1|none | 0|acc |↑ | 0.5080|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6829|± |0.0046| |mmlu | 2|none | |acc |↑ | 0.6006|± |0.0039| | - humanities | 2|none | |acc |↑ | 0.5422|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.5238|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7576|± |0.0335| | - high_school_us_history | 1|none | 0|acc |↑ | 0.7255|± |0.0313| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7722|± |0.0273| | - international_law | 1|none | 0|acc |↑ | 0.7438|± |0.0398| | - jurisprudence | 1|none | 0|acc |↑ | 0.7870|± |0.0396| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7669|± |0.0332| | - moral_disputes | 1|none | 0|acc |↑ | 0.6532|± |0.0256| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3006|± |0.0153| | - philosophy | 1|none | 0|acc |↑ | 0.6656|± |0.0268| | - prehistory | 1|none | 0|acc |↑ | 0.6728|± |0.0261| | - professional_law | 1|none | 0|acc |↑ | 0.4394|± |0.0127| | - world_religions | 1|none | 0|acc |↑ | 0.7895|± |0.0313| | - other | 2|none | |acc |↑ | 0.6460|± |0.0083| | - business_ethics | 1|none | 0|acc |↑ | 0.6600|± |0.0476| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6679|± |0.0290| | - college_medicine | 1|none | 0|acc |↑ | 0.6705|± |0.0358| | - global_facts | 1|none | 0|acc |↑ | 0.2100|± |0.0409| | - human_aging | 1|none | 0|acc |↑ | 0.6278|± |0.0324| | - management | 1|none | 0|acc |↑ | 0.8058|± |0.0392| | - marketing | 1|none | 0|acc |↑ | 0.8376|± |0.0242| | - medical_genetics | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - miscellaneous | 1|none | 0|acc |↑ | 0.7241|± |0.0160| | - nutrition | 1|none | 0|acc |↑ | 0.6732|± |0.0269| | - professional_accounting | 1|none | 0|acc |↑ | 0.4645|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.6029|± |0.0297| | - virology | 1|none | 0|acc |↑ | 0.4398|± |0.0386| | - social sciences | 2|none | |acc |↑ | 0.7065|± |0.0080| | - econometrics | 1|none | 0|acc |↑ | 0.4474|± |0.0468| | - high_school_geography | 1|none | 0|acc |↑ | 0.7576|± |0.0305| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8083|± |0.0284| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.6718|± |0.0238| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.7143|± |0.0293| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8220|± |0.0164| | - human_sexuality | 1|none | 0|acc |↑ | 0.7328|± |0.0388| | - professional_psychology | 1|none | 0|acc |↑ | 0.5997|± |0.0198| | - public_relations | 1|none | 0|acc |↑ | 0.5909|± |0.0471| | - security_studies | 1|none | 0|acc |↑ | 0.6980|± |0.0294| | - sociology | 1|none | 0|acc |↑ | 0.8109|± |0.0277| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7500|± |0.0435| | - stem | 2|none | |acc |↑ | 0.5395|± |0.0086| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - anatomy | 1|none | 0|acc |↑ | 0.5333|± |0.0431| | - astronomy | 1|none | 0|acc |↑ | 0.7105|± |0.0369| | - college_biology | 1|none | 0|acc |↑ | 0.6389|± |0.0402| | - college_chemistry | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - college_computer_science | 1|none | 0|acc |↑ | 0.5000|± |0.0503| | - college_mathematics | 1|none | 0|acc |↑ | 0.4100|± |0.0494| | - college_physics | 1|none | 0|acc |↑ | 0.4608|± |0.0496| | - computer_security | 1|none | 0|acc |↑ | 0.7300|± |0.0446| | - conceptual_physics | 1|none | 0|acc |↑ | 0.5787|± |0.0323| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6069|± |0.0407| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4947|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.7613|± |0.0243| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4975|± |0.0352| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6800|± |0.0469| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4222|± |0.0301| | - high_school_physics | 1|none | 0|acc |↑ | 0.3709|± |0.0394| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5417|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.4018|± |0.0465| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0416|± |0.0033| |openbookqa | 1|none | 0|acc |↑ | 0.3200|± |0.0209| | | |none | 0|acc_norm |↑ | 0.4060|± |0.0220| |piqa | 1|none | 0|acc |↑ | 0.7628|± |0.0099| | | |none | 0|acc_norm |↑ | 0.7584|± |0.0100| |qnli | 1|none | 0|acc |↑ | 0.5667|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9490|± |0.0070| | | |none | 0|acc_norm |↑ | 0.9390|± |0.0076| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2826|± |0.0034| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4027|± |0.0172| | | |none | 0|bleu_diff |↑ |-1.9015|± |0.4040| | | |none | 0|bleu_max |↑ |11.9071|± |0.5278| | | |none | 0|rouge1_acc |↑ | 0.4406|± |0.0174| | | |none | 0|rouge1_diff|↑ |-3.1237|± |0.6369| | | |none | 0|rouge1_max |↑ |32.1186|± |0.7534| | | |none | 0|rouge2_acc |↑ | 0.2815|± |0.0157| | | |none | 0|rouge2_diff|↑ |-3.7671|± |0.6452| | | |none | 0|rouge2_max |↑ |16.9153|± |0.7371| | | |none | 0|rougeL_acc |↑ | 0.4321|± |0.0173| | | |none | 0|rougeL_diff|↑ |-3.3951|± |0.6399| | | |none | 0|rougeL_max |↑ |29.3596|± |0.7370| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3121|± |0.0162| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4657|± |0.0150| |winogrande | 1|none | 0|acc |↑ | 0.6275|± |0.0136| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.3692|± |0.0054| |mmlu | 2|none | |acc |↑ |0.6006|± |0.0039| | - humanities | 2|none | |acc |↑ |0.5422|± |0.0068| | - other | 2|none | |acc |↑ |0.6460|± |0.0083| | - social sciences| 2|none | |acc |↑ |0.7065|± |0.0080| | - stem | 2|none | |acc |↑ |0.5395|± |0.0086| Qwen_Qwen2.5-1.5B-Instruct: 3h 20m 46s ✅ Benchmark completed for Qwen_Qwen2.5-1.5B-Instruct 🔥 Starting benchmark for Qwen_Qwen2.5-0.5B-Instruct hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen2.5-0.5B-Instruct,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3240|± |0.0148| |anli_r2 | 1|none | 0|acc |↑ | 0.3420|± |0.0150| |anli_r3 | 1|none | 0|acc |↑ | 0.3475|± |0.0138| |arc_challenge | 1|none | 0|acc |↑ | 0.3020|± |0.0134| | | |none | 0|acc_norm |↑ | 0.3370|± |0.0138| |bbh | 3|get-answer | |exact_match|↑ | 0.2138|± |0.0046| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.6440|± |0.0303| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0092| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.2240|± |0.0264| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.1280|± |0.0212| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1960|± |0.0252| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.0680|± |0.0160| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.3920|± |0.0309| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1600|± |0.0232| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0720|± |0.0164| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.3120|± |0.0294| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.2320|± |0.0268| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.5480|± |0.0315| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.1233|± |0.0273| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.1000|± |0.0190| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.1720|± |0.0239| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.0240|± |0.0097| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.2528|± |0.0327| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2640|± |0.0279| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1360|± |0.0217| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1160|± |0.0203| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.0200|± |0.0089| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0240|± |0.0097| |boolq | 2|none | 0|acc |↑ | 0.6768|± |0.0082| |drop | 3|none | 0|em |↑ | 0.0003|± |0.0002| | | |none | 0|f1 |↑ | 0.0286|± |0.0008| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1263|± |0.0237| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1768|± |0.0272| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1970|± |0.0283| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2323|± |0.0301| | | |none | 0|acc_norm |↑ | 0.2323|± |0.0301| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2626|± |0.0314| | | |none | 0|acc_norm |↑ | 0.2626|± |0.0314| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1447|± |0.0151| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1868|± |0.0167| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2015|± |0.0172| | | |strict-match | 0|exact_match|↑ | 0.0037|± |0.0026| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2601|± |0.0188| | | |none | 0|acc_norm |↑ | 0.2601|± |0.0188| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2802|± |0.0192| | | |none | 0|acc_norm |↑ | 0.2802|± |0.0192| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1585|± |0.0173| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1585|± |0.0173| | | |strict-match | 0|exact_match|↑ | 0.0022|± |0.0022| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1808|± |0.0182| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2701|± |0.0210| | | |none | 0|acc_norm |↑ | 0.2701|± |0.0210| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2679|± |0.0209| | | |none | 0|acc_norm |↑ | 0.2679|± |0.0209| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.3169|± |0.0128| | | |strict-match | 5|exact_match|↑ | 0.2077|± |0.0112| |hellaswag | 1|none | 0|acc |↑ | 0.4049|± |0.0049| | | |none | 0|acc_norm |↑ | 0.5241|± |0.0050| |mmlu | 2|none | |acc |↑ | 0.4576|± |0.0041| | - humanities | 2|none | |acc |↑ | 0.4219|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.3175|± |0.0416| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6000|± |0.0383| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5294|± |0.0350| | - high_school_world_history | 1|none | 0|acc |↑ | 0.6076|± |0.0318| | - international_law | 1|none | 0|acc |↑ | 0.7438|± |0.0398| | - jurisprudence | 1|none | 0|acc |↑ | 0.6019|± |0.0473| | - logical_fallacies | 1|none | 0|acc |↑ | 0.4724|± |0.0392| | - moral_disputes | 1|none | 0|acc |↑ | 0.5318|± |0.0269| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2380|± |0.0142| | - philosophy | 1|none | 0|acc |↑ | 0.4759|± |0.0284| | - prehistory | 1|none | 0|acc |↑ | 0.5432|± |0.0277| | - professional_law | 1|none | 0|acc |↑ | 0.3520|± |0.0122| | - world_religions | 1|none | 0|acc |↑ | 0.5906|± |0.0377| | - other | 2|none | |acc |↑ | 0.5082|± |0.0088| | - business_ethics | 1|none | 0|acc |↑ | 0.5300|± |0.0502| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.5094|± |0.0308| | - college_medicine | 1|none | 0|acc |↑ | 0.4509|± |0.0379| | - global_facts | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - human_aging | 1|none | 0|acc |↑ | 0.5426|± |0.0334| | - management | 1|none | 0|acc |↑ | 0.5728|± |0.0490| | - marketing | 1|none | 0|acc |↑ | 0.7393|± |0.0288| | - medical_genetics | 1|none | 0|acc |↑ | 0.5200|± |0.0502| | - miscellaneous | 1|none | 0|acc |↑ | 0.5556|± |0.0178| | - nutrition | 1|none | 0|acc |↑ | 0.5882|± |0.0282| | - professional_accounting | 1|none | 0|acc |↑ | 0.3191|± |0.0278| | - professional_medicine | 1|none | 0|acc |↑ | 0.3676|± |0.0293| | - virology | 1|none | 0|acc |↑ | 0.4337|± |0.0386| | - social sciences | 2|none | |acc |↑ | 0.5301|± |0.0089| | - econometrics | 1|none | 0|acc |↑ | 0.2895|± |0.0427| | - high_school_geography | 1|none | 0|acc |↑ | 0.5657|± |0.0353| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.5544|± |0.0359| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4410|± |0.0252| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4748|± |0.0324| | - high_school_psychology | 1|none | 0|acc |↑ | 0.6183|± |0.0208| | - human_sexuality | 1|none | 0|acc |↑ | 0.5573|± |0.0436| | - professional_psychology | 1|none | 0|acc |↑ | 0.4608|± |0.0202| | - public_relations | 1|none | 0|acc |↑ | 0.5273|± |0.0478| | - security_studies | 1|none | 0|acc |↑ | 0.5633|± |0.0318| | - sociology | 1|none | 0|acc |↑ | 0.6667|± |0.0333| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7200|± |0.0451| | - stem | 2|none | |acc |↑ | 0.3901|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - anatomy | 1|none | 0|acc |↑ | 0.4000|± |0.0423| | - astronomy | 1|none | 0|acc |↑ | 0.4737|± |0.0406| | - college_biology | 1|none | 0|acc |↑ | 0.4444|± |0.0416| | - college_chemistry | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - college_computer_science | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - college_mathematics | 1|none | 0|acc |↑ | 0.2700|± |0.0446| | - college_physics | 1|none | 0|acc |↑ | 0.2745|± |0.0444| | - computer_security | 1|none | 0|acc |↑ | 0.6900|± |0.0465| | - conceptual_physics | 1|none | 0|acc |↑ | 0.3830|± |0.0318| | - electrical_engineering | 1|none | 0|acc |↑ | 0.5103|± |0.0417| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3307|± |0.0242| | - high_school_biology | 1|none | 0|acc |↑ | 0.5355|± |0.0284| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.4138|± |0.0347| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4400|± |0.0499| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3000|± |0.0279| | - high_school_physics | 1|none | 0|acc |↑ | 0.2517|± |0.0354| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3333|± |0.0321| | - machine_learning | 1|none | 0|acc |↑ | 0.4107|± |0.0467| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0205|± |0.0024| |openbookqa | 1|none | 0|acc |↑ | 0.2440|± |0.0192| | | |none | 0|acc_norm |↑ | 0.3460|± |0.0213| |piqa | 1|none | 0|acc |↑ | 0.7062|± |0.0106| | | |none | 0|acc_norm |↑ | 0.7040|± |0.0107| |qnli | 1|none | 0|acc |↑ | 0.5369|± |0.0067| |sciq | 1|none | 0|acc |↑ | 0.9190|± |0.0086| | | |none | 0|acc_norm |↑ | 0.8830|± |0.0102| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1342|± |0.0025| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3219|± |0.0164| | | |none | 0|bleu_diff |↑ |-0.2967|± |0.0886| | | |none | 0|bleu_max |↑ | 2.9561|± |0.1554| | | |none | 0|rouge1_acc |↑ | 0.3770|± |0.0170| | | |none | 0|rouge1_diff|↑ |-0.1350|± |0.2257| | | |none | 0|rouge1_max |↑ |12.8150|± |0.3731| | | |none | 0|rouge2_acc |↑ | 0.2154|± |0.0144| | | |none | 0|rouge2_diff|↑ |-0.5713|± |0.1837| | | |none | 0|rouge2_max |↑ | 5.3013|± |0.2882| | | |none | 0|rougeL_acc |↑ | 0.3537|± |0.0167| | | |none | 0|rougeL_diff|↑ |-0.4059|± |0.2150| | | |none | 0|rougeL_max |↑ |11.6973|± |0.3485| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2717|± |0.0156| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4184|± |0.0146| |winogrande | 1|none | 0|acc |↑ | 0.5564|± |0.0140| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.2138|± |0.0046| |mmlu | 2|none | |acc |↑ |0.4576|± |0.0041| | - humanities | 2|none | |acc |↑ |0.4219|± |0.0069| | - other | 2|none | |acc |↑ |0.5082|± |0.0088| | - social sciences| 2|none | |acc |↑ |0.5301|± |0.0089| | - stem | 2|none | |acc |↑ |0.3901|± |0.0085| Qwen_Qwen2.5-0.5B-Instruct: 2h 34m 22s ✅ Benchmark completed for Qwen_Qwen2.5-0.5B-Instruct 🔥 Starting benchmark for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3560|± |0.0151| |anli_r2 | 1|none | 0|acc |↑ | 0.3620|± |0.0152| |anli_r3 | 1|none | 0|acc |↑ | 0.3625|± |0.0139| |arc_challenge | 1|none | 0|acc |↑ | 0.3422|± |0.0139| | | |none | 0|acc_norm |↑ | 0.3464|± |0.0139| |bbh | 3|get-answer | |exact_match|↑ | 0.4059|± |0.0051| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3850|± |0.0357| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.4520|± |0.0315| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.2960|± |0.0289| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0160|± |0.0080| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1600|± |0.0232| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.2560|± |0.0277| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.3600|± |0.0304| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1520|± |0.0228| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0080|± |0.0056| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.6280|± |0.0306| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5080|± |0.0317| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9280|± |0.0164| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.7840|± |0.0261| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6320|± |0.0306| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4932|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5560|± |0.0315| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.2440|± |0.0272| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.1280|± |0.0212| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.1798|± |0.0289| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.4760|± |0.0316| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1760|± |0.0241| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.4240|± |0.0313| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.6280|± |0.0306| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9200|± |0.0172| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0960|± |0.0187| |boolq | 2|none | 0|acc |↑ | 0.6801|± |0.0082| |drop | 3|none | 0|em |↑ | 0.0008|± |0.0003| | | |none | 0|f1 |↑ | 0.0507|± |0.0013| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0606|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0606|± |0.0170| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1364|± |0.0245| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2727|± |0.0317| | | |none | 0|acc_norm |↑ | 0.2727|± |0.0317| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2727|± |0.0317| | | |none | 0|acc_norm |↑ | 0.2727|± |0.0317| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0788|± |0.0115| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0824|± |0.0118| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1740|± |0.0162| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2418|± |0.0183| | | |none | 0|acc_norm |↑ | 0.2418|± |0.0183| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3205|± |0.0200| | | |none | 0|acc_norm |↑ | 0.3205|± |0.0200| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0714|± |0.0122| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0759|± |0.0125| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1451|± |0.0167| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2612|± |0.0208| | | |none | 0|acc_norm |↑ | 0.2612|± |0.0208| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2723|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2723|± |0.0211| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.7074|± |0.0125| | | |strict-match | 5|exact_match|↑ | 0.7013|± |0.0126| |hellaswag | 1|none | 0|acc |↑ | 0.3633|± |0.0048| | | |none | 0|acc_norm |↑ | 0.4467|± |0.0050| |mmlu | 2|none | |acc |↑ | 0.3606|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.3135|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.3968|± |0.0438| | - high_school_european_history | 1|none | 0|acc |↑ | 0.3515|± |0.0373| | - high_school_us_history | 1|none | 0|acc |↑ | 0.3235|± |0.0328| | - high_school_world_history | 1|none | 0|acc |↑ | 0.4135|± |0.0321| | - international_law | 1|none | 0|acc |↑ | 0.4463|± |0.0454| | - jurisprudence | 1|none | 0|acc |↑ | 0.4259|± |0.0478| | - logical_fallacies | 1|none | 0|acc |↑ | 0.4110|± |0.0387| | - moral_disputes | 1|none | 0|acc |↑ | 0.3555|± |0.0258| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2447|± |0.0144| | - philosophy | 1|none | 0|acc |↑ | 0.4051|± |0.0279| | - prehistory | 1|none | 0|acc |↑ | 0.3765|± |0.0270| | - professional_law | 1|none | 0|acc |↑ | 0.2595|± |0.0112| | - world_religions | 1|none | 0|acc |↑ | 0.2807|± |0.0345| | - other | 2|none | |acc |↑ | 0.3859|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.3811|± |0.0299| | - college_medicine | 1|none | 0|acc |↑ | 0.3468|± |0.0363| | - global_facts | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - human_aging | 1|none | 0|acc |↑ | 0.3767|± |0.0325| | - management | 1|none | 0|acc |↑ | 0.5146|± |0.0495| | - marketing | 1|none | 0|acc |↑ | 0.5726|± |0.0324| | - medical_genetics | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - miscellaneous | 1|none | 0|acc |↑ | 0.3921|± |0.0175| | - nutrition | 1|none | 0|acc |↑ | 0.4118|± |0.0282| | - professional_accounting | 1|none | 0|acc |↑ | 0.2943|± |0.0272| | - professional_medicine | 1|none | 0|acc |↑ | 0.2721|± |0.0270| | - virology | 1|none | 0|acc |↑ | 0.3735|± |0.0377| | - social sciences | 2|none | |acc |↑ | 0.4027|± |0.0088| | - econometrics | 1|none | 0|acc |↑ | 0.3070|± |0.0434| | - high_school_geography | 1|none | 0|acc |↑ | 0.3535|± |0.0341| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.3938|± |0.0353| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.3821|± |0.0246| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4580|± |0.0324| | - high_school_psychology | 1|none | 0|acc |↑ | 0.4624|± |0.0214| | - human_sexuality | 1|none | 0|acc |↑ | 0.4351|± |0.0435| | - professional_psychology | 1|none | 0|acc |↑ | 0.3317|± |0.0190| | - public_relations | 1|none | 0|acc |↑ | 0.3818|± |0.0465| | - security_studies | 1|none | 0|acc |↑ | 0.4367|± |0.0318| | - sociology | 1|none | 0|acc |↑ | 0.4527|± |0.0352| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.4800|± |0.0502| | - stem | 2|none | |acc |↑ | 0.3650|± |0.0085| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3000|± |0.0461| | - anatomy | 1|none | 0|acc |↑ | 0.3037|± |0.0397| | - astronomy | 1|none | 0|acc |↑ | 0.4013|± |0.0399| | - college_biology | 1|none | 0|acc |↑ | 0.3056|± |0.0385| | - college_chemistry | 1|none | 0|acc |↑ | 0.3100|± |0.0465| | - college_computer_science | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - college_mathematics | 1|none | 0|acc |↑ | 0.3800|± |0.0488| | - college_physics | 1|none | 0|acc |↑ | 0.2941|± |0.0453| | - computer_security | 1|none | 0|acc |↑ | 0.3600|± |0.0482| | - conceptual_physics | 1|none | 0|acc |↑ | 0.4043|± |0.0321| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4345|± |0.0413| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.4471|± |0.0256| | - high_school_biology | 1|none | 0|acc |↑ | 0.4258|± |0.0281| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3202|± |0.0328| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3519|± |0.0291| | - high_school_physics | 1|none | 0|acc |↑ | 0.1987|± |0.0326| | - high_school_statistics | 1|none | 0|acc |↑ | 0.3472|± |0.0325| | - machine_learning | 1|none | 0|acc |↑ | 0.3393|± |0.0449| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0064|± |0.0013| |openbookqa | 1|none | 0|acc |↑ | 0.1980|± |0.0178| | | |none | 0|acc_norm |↑ | 0.3080|± |0.0207| |piqa | 1|none | 0|acc |↑ | 0.6513|± |0.0111| | | |none | 0|acc_norm |↑ | 0.6578|± |0.0111| |qnli | 1|none | 0|acc |↑ | 0.5054|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.8990|± |0.0095| | | |none | 0|acc_norm |↑ | 0.8450|± |0.0115| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0090|± |0.0007| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3268|± |0.0164| | | |none | 0|bleu_diff |↑ |-2.4557|± |0.4017| | | |none | 0|bleu_max |↑ |12.6086|± |0.4908| | | |none | 0|rouge1_acc |↑ | 0.3476|± |0.0167| | | |none | 0|rouge1_diff|↑ |-3.9220|± |0.5617| | | |none | 0|rouge1_max |↑ |34.8664|± |0.7473| | | |none | 0|rouge2_acc |↑ | 0.2289|± |0.0147| | | |none | 0|rouge2_diff|↑ |-5.0782|± |0.6278| | | |none | 0|rouge2_max |↑ |20.7313|± |0.7617| | | |none | 0|rougeL_acc |↑ | 0.3329|± |0.0165| | | |none | 0|rougeL_diff|↑ |-4.0907|± |0.5646| | | |none | 0|rougeL_max |↑ |32.3187|± |0.7355| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2938|± |0.0159| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4517|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.5493|± |0.0140| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4059|± |0.0051| |mmlu | 2|none | |acc |↑ |0.3606|± |0.0040| | - humanities | 2|none | |acc |↑ |0.3135|± |0.0067| | - other | 2|none | |acc |↑ |0.3859|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.4027|± |0.0088| | - stem | 2|none | |acc |↑ |0.3650|± |0.0085| deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B: 3h 41m 4s ✅ Benchmark completed for deepseek-ai_DeepSeek-R1-Distill-Qwen-1.5B 🔥 Starting benchmark for Qwen_Qwen3-1.7B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen3-1.7B,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.4100|± |0.0156| |anli_r2 | 1|none | 0|acc |↑ | 0.4040|± |0.0155| |anli_r3 | 1|none | 0|acc |↑ | 0.4342|± |0.0143| |arc_challenge | 1|none | 0|acc |↑ | 0.3985|± |0.0143| | | |none | 0|acc_norm |↑ | 0.4343|± |0.0145| |bbh | 3|get-answer | |exact_match|↑ | 0.4826|± |0.0048| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9120|± |0.0180| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.4920|± |0.0367| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.6280|± |0.0306| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.2840|± |0.0286| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0520|± |0.0141| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3040|± |0.0292| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6840|± |0.0295| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.1200|± |0.0206| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.7880|± |0.0259| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.5720|± |0.0314| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9640|± |0.0118| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.8880|± |0.0200| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.7560|± |0.0272| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.5890|± |0.0409| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.5960|± |0.0311| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.5000|± |0.0317| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.0840|± |0.0176| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.4101|± |0.0370| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.2920|± |0.0288| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2080|± |0.0257| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.8120|± |0.0248| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.1800|± |0.0243| |boolq | 2|none | 0|acc |↑ | 0.7765|± |0.0073| |drop | 3|none | 0|em |↑ | 0.0031|± |0.0006| | | |none | 0|f1 |↑ | 0.0753|± |0.0018| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0758|± |0.0189| | | |strict-match | 0|exact_match|↑ | 0.0051|± |0.0051| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0253|± |0.0112| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1263|± |0.0237| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2475|± |0.0307| | | |none | 0|acc_norm |↑ | 0.2475|± |0.0307| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3232|± |0.0333| | | |none | 0|acc_norm |↑ | 0.3232|± |0.0333| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0824|± |0.0118| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0934|± |0.0125| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1722|± |0.0162| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2949|± |0.0195| | | |none | 0|acc_norm |↑ | 0.2949|± |0.0195| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.2967|± |0.0196| | | |none | 0|acc_norm |↑ | 0.2967|± |0.0196| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0893|± |0.0135| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0804|± |0.0129| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1473|± |0.0168| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2612|± |0.0208| | | |none | 0|acc_norm |↑ | 0.2612|± |0.0208| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2902|± |0.0215| | | |none | 0|acc_norm |↑ | 0.2902|± |0.0215| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.6922|± |0.0127| | | |strict-match | 5|exact_match|↑ | 0.6899|± |0.0127| |hellaswag | 1|none | 0|acc |↑ | 0.4606|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6038|± |0.0049| |mmlu | 2|none | |acc |↑ | 0.5538|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.4854|± |0.0069| | - formal_logic | 1|none | 0|acc |↑ | 0.4841|± |0.0447| | - high_school_european_history | 1|none | 0|acc |↑ | 0.6727|± |0.0366| | - high_school_us_history | 1|none | 0|acc |↑ | 0.6618|± |0.0332| | - high_school_world_history | 1|none | 0|acc |↑ | 0.7046|± |0.0297| | - international_law | 1|none | 0|acc |↑ | 0.6364|± |0.0439| | - jurisprudence | 1|none | 0|acc |↑ | 0.6852|± |0.0449| | - logical_fallacies | 1|none | 0|acc |↑ | 0.7055|± |0.0358| | - moral_disputes | 1|none | 0|acc |↑ | 0.6069|± |0.0263| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.6141|± |0.0276| | - prehistory | 1|none | 0|acc |↑ | 0.6080|± |0.0272| | - professional_law | 1|none | 0|acc |↑ | 0.3924|± |0.0125| | - world_religions | 1|none | 0|acc |↑ | 0.7427|± |0.0335| | - other | 2|none | |acc |↑ | 0.5993|± |0.0085| | - business_ethics | 1|none | 0|acc |↑ | 0.5900|± |0.0494| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.6000|± |0.0302| | - college_medicine | 1|none | 0|acc |↑ | 0.5896|± |0.0375| | - global_facts | 1|none | 0|acc |↑ | 0.2300|± |0.0423| | - human_aging | 1|none | 0|acc |↑ | 0.5695|± |0.0332| | - management | 1|none | 0|acc |↑ | 0.6699|± |0.0466| | - marketing | 1|none | 0|acc |↑ | 0.8248|± |0.0249| | - medical_genetics | 1|none | 0|acc |↑ | 0.6400|± |0.0482| | - miscellaneous | 1|none | 0|acc |↑ | 0.6871|± |0.0166| | - nutrition | 1|none | 0|acc |↑ | 0.5850|± |0.0282| | - professional_accounting | 1|none | 0|acc |↑ | 0.4149|± |0.0294| | - professional_medicine | 1|none | 0|acc |↑ | 0.5515|± |0.0302| | - virology | 1|none | 0|acc |↑ | 0.4940|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.6341|± |0.0085| | - econometrics | 1|none | 0|acc |↑ | 0.4825|± |0.0470| | - high_school_geography | 1|none | 0|acc |↑ | 0.6919|± |0.0329| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.6995|± |0.0331| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.5410|± |0.0253| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.6345|± |0.0313| | - high_school_psychology | 1|none | 0|acc |↑ | 0.7725|± |0.0180| | - human_sexuality | 1|none | 0|acc |↑ | 0.6641|± |0.0414| | - professional_psychology | 1|none | 0|acc |↑ | 0.5523|± |0.0201| | - public_relations | 1|none | 0|acc |↑ | 0.5455|± |0.0477| | - security_studies | 1|none | 0|acc |↑ | 0.5918|± |0.0315| | - sociology | 1|none | 0|acc |↑ | 0.6816|± |0.0329| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.7400|± |0.0441| | - stem | 2|none | |acc |↑ | 0.5325|± |0.0087| | - abstract_algebra | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - anatomy | 1|none | 0|acc |↑ | 0.5259|± |0.0431| | - astronomy | 1|none | 0|acc |↑ | 0.6382|± |0.0391| | - college_biology | 1|none | 0|acc |↑ | 0.6806|± |0.0390| | - college_chemistry | 1|none | 0|acc |↑ | 0.4200|± |0.0496| | - college_computer_science | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - college_mathematics | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - college_physics | 1|none | 0|acc |↑ | 0.3333|± |0.0469| | - computer_security | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - conceptual_physics | 1|none | 0|acc |↑ | 0.6511|± |0.0312| | - electrical_engineering | 1|none | 0|acc |↑ | 0.6000|± |0.0408| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.5238|± |0.0257| | - high_school_biology | 1|none | 0|acc |↑ | 0.6806|± |0.0265| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.5320|± |0.0351| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.3481|± |0.0290| | - high_school_physics | 1|none | 0|acc |↑ | 0.4106|± |0.0402| | - high_school_statistics | 1|none | 0|acc |↑ | 0.5463|± |0.0340| | - machine_learning | 1|none | 0|acc |↑ | 0.4196|± |0.0468| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0222|± |0.0025| |openbookqa | 1|none | 0|acc |↑ | 0.2820|± |0.0201| | | |none | 0|acc_norm |↑ | 0.3760|± |0.0217| |piqa | 1|none | 0|acc |↑ | 0.7242|± |0.0104| | | |none | 0|acc_norm |↑ | 0.7203|± |0.0105| |qnli | 1|none | 0|acc |↑ | 0.5105|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.9310|± |0.0080| | | |none | 0|acc_norm |↑ | 0.9140|± |0.0089| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.1350|± |0.0026| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.4688|± |0.0175| | | |none | 0|bleu_diff |↑ | 2.7742|± |0.9353| | | |none | 0|bleu_max |↑ |26.4045|± |0.8442| | | |none | 0|rouge1_acc |↑ | 0.4627|± |0.0175| | | |none | 0|rouge1_diff|↑ | 4.5361|± |1.2612| | | |none | 0|rouge1_max |↑ |50.0513|± |0.9614| | | |none | 0|rouge2_acc |↑ | 0.3354|± |0.0165| | | |none | 0|rouge2_diff|↑ | 2.8338|± |1.3839| | | |none | 0|rouge2_max |↑ |34.8104|± |1.1528| | | |none | 0|rougeL_acc |↑ | 0.4529|± |0.0174| | | |none | 0|rougeL_diff|↑ | 4.3615|± |1.2613| | | |none | 0|rougeL_max |↑ |47.8592|± |0.9740| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2950|± |0.0160| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4588|± |0.0155| |winogrande | 1|none | 0|acc |↑ | 0.6085|± |0.0137| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4826|± |0.0048| |mmlu | 2|none | |acc |↑ |0.5538|± |0.0040| | - humanities | 2|none | |acc |↑ |0.4854|± |0.0069| | - other | 2|none | |acc |↑ |0.5993|± |0.0085| | - social sciences| 2|none | |acc |↑ |0.6341|± |0.0085| | - stem | 2|none | |acc |↑ |0.5325|± |0.0087| Qwen_Qwen3-1.7B: 4h 25m 25s ✅ Benchmark completed for Qwen_Qwen3-1.7B 🔥 Starting benchmark for Qwen_Qwen3-0.6B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen3-0.6B,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.3430|± |0.0150| |anli_r2 | 1|none | 0|acc |↑ | 0.3190|± |0.0147| |anli_r3 | 1|none | 0|acc |↑ | 0.3442|± |0.0137| |arc_challenge | 1|none | 0|acc |↑ | 0.3123|± |0.0135| | | |none | 0|acc_norm |↑ | 0.3422|± |0.0139| |bbh | 3|get-answer | |exact_match|↑ | 0.4148|± |0.0053| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.7560|± |0.0272| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3529|± |0.0350| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.3160|± |0.0295| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.0000|± |0.0000| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.3280|± |0.0298| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.3960|± |0.0310| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.6160|± |0.0308| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2280|± |0.0266| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.0920|± |0.0183| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.5520|± |0.0315| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.4520|± |0.0315| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.8240|± |0.0241| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.6640|± |0.0299| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.4726|± |0.0415| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.6000|± |0.0310| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.2120|± |0.0259| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.2360|± |0.0269| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.2921|± |0.0342| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.4880|± |0.0317| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.1280|± |0.0212| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.2840|± |0.0286| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.1240|± |0.0209| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.5640|± |0.0314| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 0.9840|± |0.0080| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.0520|± |0.0141| |boolq | 2|none | 0|acc |↑ | 0.6391|± |0.0084| |drop | 3|none | 0|em |↑ | 0.0007|± |0.0003| | | |none | 0|f1 |↑ | 0.0605|± |0.0013| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1768|± |0.0272| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1010|± |0.0215| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2222|± |0.0296| | | |strict-match | 0|exact_match|↑ | 0.0101|± |0.0071| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.2626|± |0.0314| | | |none | 0|acc_norm |↑ | 0.2626|± |0.0314| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.2677|± |0.0315| | | |none | 0|acc_norm |↑ | 0.2677|± |0.0315| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1996|± |0.0171| | | |strict-match | 0|exact_match|↑ | 0.0110|± |0.0045| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1245|± |0.0141| | | |strict-match | 0|exact_match|↑ | 0.0018|± |0.0018| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2674|± |0.0190| | | |strict-match | 0|exact_match|↑ | 0.0311|± |0.0074| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.2674|± |0.0190| | | |none | 0|acc_norm |↑ | 0.2674|± |0.0190| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3022|± |0.0197| | | |none | 0|acc_norm |↑ | 0.3022|± |0.0197| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1897|± |0.0185| | | |strict-match | 0|exact_match|↑ | 0.0067|± |0.0039| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.1027|± |0.0144| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2277|± |0.0198| | | |strict-match | 0|exact_match|↑ | 0.0134|± |0.0054| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.2723|± |0.0211| | | |none | 0|acc_norm |↑ | 0.2723|± |0.0211| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.2701|± |0.0210| | | |none | 0|acc_norm |↑ | 0.2701|± |0.0210| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.4109|± |0.0136| | | |strict-match | 5|exact_match|↑ | 0.4124|± |0.0136| |hellaswag | 1|none | 0|acc |↑ | 0.3763|± |0.0048| | | |none | 0|acc_norm |↑ | 0.4719|± |0.0050| |mmlu | 2|none | |acc |↑ | 0.4013|± |0.0040| | - humanities | 2|none | |acc |↑ | 0.3654|± |0.0068| | - formal_logic | 1|none | 0|acc |↑ | 0.4206|± |0.0442| | - high_school_european_history | 1|none | 0|acc |↑ | 0.5455|± |0.0389| | - high_school_us_history | 1|none | 0|acc |↑ | 0.5000|± |0.0351| | - high_school_world_history | 1|none | 0|acc |↑ | 0.5907|± |0.0320| | - international_law | 1|none | 0|acc |↑ | 0.5620|± |0.0453| | - jurisprudence | 1|none | 0|acc |↑ | 0.4167|± |0.0477| | - logical_fallacies | 1|none | 0|acc |↑ | 0.4724|± |0.0392| | - moral_disputes | 1|none | 0|acc |↑ | 0.3208|± |0.0251| | - moral_scenarios | 1|none | 0|acc |↑ | 0.2425|± |0.0143| | - philosophy | 1|none | 0|acc |↑ | 0.4116|± |0.0280| | - prehistory | 1|none | 0|acc |↑ | 0.4321|± |0.0276| | - professional_law | 1|none | 0|acc |↑ | 0.2986|± |0.0117| | - world_religions | 1|none | 0|acc |↑ | 0.5263|± |0.0383| | - other | 2|none | |acc |↑ | 0.4245|± |0.0087| | - business_ethics | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.3283|± |0.0289| | - college_medicine | 1|none | 0|acc |↑ | 0.2890|± |0.0346| | - global_facts | 1|none | 0|acc |↑ | 0.2400|± |0.0429| | - human_aging | 1|none | 0|acc |↑ | 0.4664|± |0.0335| | - management | 1|none | 0|acc |↑ | 0.5340|± |0.0494| | - marketing | 1|none | 0|acc |↑ | 0.6325|± |0.0316| | - medical_genetics | 1|none | 0|acc |↑ | 0.3900|± |0.0490| | - miscellaneous | 1|none | 0|acc |↑ | 0.4891|± |0.0179| | - nutrition | 1|none | 0|acc |↑ | 0.4641|± |0.0286| | - professional_accounting | 1|none | 0|acc |↑ | 0.2908|± |0.0271| | - professional_medicine | 1|none | 0|acc |↑ | 0.3235|± |0.0284| | - virology | 1|none | 0|acc |↑ | 0.4458|± |0.0387| | - social sciences | 2|none | |acc |↑ | 0.4777|± |0.0089| | - econometrics | 1|none | 0|acc |↑ | 0.2895|± |0.0427| | - high_school_geography | 1|none | 0|acc |↑ | 0.4697|± |0.0356| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.5233|± |0.0360| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.4128|± |0.0250| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.4160|± |0.0320| | - high_school_psychology | 1|none | 0|acc |↑ | 0.5615|± |0.0213| | - human_sexuality | 1|none | 0|acc |↑ | 0.5038|± |0.0439| | - professional_psychology | 1|none | 0|acc |↑ | 0.4085|± |0.0199| | - public_relations | 1|none | 0|acc |↑ | 0.4545|± |0.0477| | - security_studies | 1|none | 0|acc |↑ | 0.5102|± |0.0320| | - sociology | 1|none | 0|acc |↑ | 0.6418|± |0.0339| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.5700|± |0.0498| | - stem | 2|none | |acc |↑ | 0.3574|± |0.0084| | - abstract_algebra | 1|none | 0|acc |↑ | 0.2900|± |0.0456| | - anatomy | 1|none | 0|acc |↑ | 0.3704|± |0.0417| | - astronomy | 1|none | 0|acc |↑ | 0.4474|± |0.0405| | - college_biology | 1|none | 0|acc |↑ | 0.4583|± |0.0417| | - college_chemistry | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_computer_science | 1|none | 0|acc |↑ | 0.2600|± |0.0441| | - college_mathematics | 1|none | 0|acc |↑ | 0.3300|± |0.0473| | - college_physics | 1|none | 0|acc |↑ | 0.2647|± |0.0439| | - computer_security | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - conceptual_physics | 1|none | 0|acc |↑ | 0.3745|± |0.0316| | - electrical_engineering | 1|none | 0|acc |↑ | 0.4345|± |0.0413| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.3571|± |0.0247| | - high_school_biology | 1|none | 0|acc |↑ | 0.4419|± |0.0283| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.3300|± |0.0331| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.4300|± |0.0498| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.2815|± |0.0274| | - high_school_physics | 1|none | 0|acc |↑ | 0.2252|± |0.0341| | - high_school_statistics | 1|none | 0|acc |↑ | 0.2361|± |0.0290| | - machine_learning | 1|none | 0|acc |↑ | 0.3661|± |0.0457| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0205|± |0.0024| |openbookqa | 1|none | 0|acc |↑ | 0.2160|± |0.0184| | | |none | 0|acc_norm |↑ | 0.3200|± |0.0209| |piqa | 1|none | 0|acc |↑ | 0.6752|± |0.0109| | | |none | 0|acc_norm |↑ | 0.6752|± |0.0109| |qnli | 1|none | 0|acc |↑ | 0.4961|± |0.0068| |sciq | 1|none | 0|acc |↑ | 0.8720|± |0.0106| | | |none | 0|acc_norm |↑ | 0.8330|± |0.0118| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.0193|± |0.0010| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.3305|± |0.0165| | | |none | 0|bleu_diff |↑ |-4.2316|± |0.5836| | | |none | 0|bleu_max |↑ |17.1879|± |0.6583| | | |none | 0|rouge1_acc |↑ | 0.2987|± |0.0160| | | |none | 0|rouge1_diff|↑ |-7.1428|± |0.6809| | | |none | 0|rouge1_max |↑ |37.0632|± |0.8923| | | |none | 0|rouge2_acc |↑ | 0.2166|± |0.0144| | | |none | 0|rouge2_diff|↑ |-7.9206|± |0.7858| | | |none | 0|rouge2_max |↑ |21.7683|± |0.8976| | | |none | 0|rougeL_acc |↑ | 0.2938|± |0.0159| | | |none | 0|rougeL_diff|↑ |-7.4867|± |0.6710| | | |none | 0|rougeL_max |↑ |34.4220|± |0.8733| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.2705|± |0.0156| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.4277|± |0.0145| |winogrande | 1|none | 0|acc |↑ | 0.5517|± |0.0140| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.4148|± |0.0053| |mmlu | 2|none | |acc |↑ |0.4013|± |0.0040| | - humanities | 2|none | |acc |↑ |0.3654|± |0.0068| | - other | 2|none | |acc |↑ |0.4245|± |0.0087| | - social sciences| 2|none | |acc |↑ |0.4777|± |0.0089| | - stem | 2|none | |acc |↑ |0.3574|± |0.0084| Qwen_Qwen3-0.6B: 3h 45m 57s ✅ Benchmark completed for Qwen_Qwen3-0.6B 🔥 Starting benchmark for Qwen_Qwen3-4B hf (pretrained=/mnt/data8tb/Documents/llm/llm_models/Qwen_Qwen3-4B,trust_remote_code=True,device_map=auto), gen_kwargs: (None), limit: None, num_fewshot: None, batch_size: 6 | Tasks |Version| Filter |n-shot| Metric | | Value | |Stderr| |----------------------------------------------------------|------:|-----------------|-----:|-----------|---|------:|---|-----:| |anli_r1 | 1|none | 0|acc |↑ | 0.5500|± |0.0157| |anli_r2 | 1|none | 0|acc |↑ | 0.4610|± |0.0158| |anli_r3 | 1|none | 0|acc |↑ | 0.5133|± |0.0144| |arc_challenge | 1|none | 0|acc |↑ | 0.5043|± |0.0146| | | |none | 0|acc_norm |↑ | 0.5392|± |0.0146| |bbh | 3|get-answer | |exact_match|↑ | 0.7523|± |0.0047| | - bbh_cot_fewshot_boolean_expressions | 4|get-answer | 3|exact_match|↑ | 0.9640|± |0.0118| | - bbh_cot_fewshot_causal_judgement | 4|get-answer | 3|exact_match|↑ | 0.3636|± |0.0353| | - bbh_cot_fewshot_date_understanding | 4|get-answer | 3|exact_match|↑ | 0.7800|± |0.0263| | - bbh_cot_fewshot_disambiguation_qa | 4|get-answer | 3|exact_match|↑ | 0.6120|± |0.0309| | - bbh_cot_fewshot_dyck_languages | 4|get-answer | 3|exact_match|↑ | 0.3800|± |0.0308| | - bbh_cot_fewshot_formal_fallacies | 4|get-answer | 3|exact_match|↑ | 0.6360|± |0.0305| | - bbh_cot_fewshot_geometric_shapes | 4|get-answer | 3|exact_match|↑ | 0.5040|± |0.0317| | - bbh_cot_fewshot_hyperbaton | 4|get-answer | 3|exact_match|↑ | 0.9560|± |0.0130| | - bbh_cot_fewshot_logical_deduction_five_objects | 4|get-answer | 3|exact_match|↑ | 0.5800|± |0.0313| | - bbh_cot_fewshot_logical_deduction_seven_objects | 4|get-answer | 3|exact_match|↑ | 0.2920|± |0.0288| | - bbh_cot_fewshot_logical_deduction_three_objects | 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_movie_recommendation | 4|get-answer | 3|exact_match|↑ | 0.7040|± |0.0289| | - bbh_cot_fewshot_multistep_arithmetic_two | 4|get-answer | 3|exact_match|↑ | 0.9920|± |0.0056| | - bbh_cot_fewshot_navigate | 4|get-answer | 3|exact_match|↑ | 0.9200|± |0.0172| | - bbh_cot_fewshot_object_counting | 4|get-answer | 3|exact_match|↑ | 0.8480|± |0.0228| | - bbh_cot_fewshot_penguins_in_a_table | 4|get-answer | 3|exact_match|↑ | 0.7740|± |0.0347| | - bbh_cot_fewshot_reasoning_about_colored_objects | 4|get-answer | 3|exact_match|↑ | 0.8600|± |0.0220| | - bbh_cot_fewshot_ruin_names | 4|get-answer | 3|exact_match|↑ | 0.7600|± |0.0271| | - bbh_cot_fewshot_salient_translation_error_detection | 4|get-answer | 3|exact_match|↑ | 0.5880|± |0.0312| | - bbh_cot_fewshot_snarks | 4|get-answer | 3|exact_match|↑ | 0.6966|± |0.0346| | - bbh_cot_fewshot_sports_understanding | 4|get-answer | 3|exact_match|↑ | 0.8280|± |0.0239| | - bbh_cot_fewshot_temporal_sequences | 4|get-answer | 3|exact_match|↑ | 0.8840|± |0.0203| | - bbh_cot_fewshot_tracking_shuffled_objects_five_objects | 4|get-answer | 3|exact_match|↑ | 0.9800|± |0.0089| | - bbh_cot_fewshot_tracking_shuffled_objects_seven_objects| 4|get-answer | 3|exact_match|↑ | 0.9080|± |0.0183| | - bbh_cot_fewshot_tracking_shuffled_objects_three_objects| 4|get-answer | 3|exact_match|↑ | 0.9960|± |0.0040| | - bbh_cot_fewshot_web_of_lies | 4|get-answer | 3|exact_match|↑ | 1.0000|± |0.0000| | - bbh_cot_fewshot_word_sorting | 4|get-answer | 3|exact_match|↑ | 0.4920|± |0.0317| |boolq | 2|none | 0|acc |↑ | 0.8505|± |0.0062| |drop | 3|none | 0|em |↑ | 0.0060|± |0.0008| | | |none | 0|f1 |↑ | 0.0977|± |0.0020| |gpqa_diamond_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1111|± |0.0224| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0859|± |0.0200| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1818|± |0.0275| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_diamond_n_shot | 2|none | 0|acc |↑ | 0.3939|± |0.0348| | | |none | 0|acc_norm |↑ | 0.3939|± |0.0348| |gpqa_diamond_zeroshot | 1|none | 0|acc |↑ | 0.3636|± |0.0343| | | |none | 0|acc_norm |↑ | 0.3636|± |0.0343| |gpqa_extended_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.1136|± |0.0136| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0879|± |0.0121| | | |strict-match | 0|exact_match|↑ | 0.0055|± |0.0032| |gpqa_extended_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2436|± |0.0184| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_extended_n_shot | 2|none | 0|acc |↑ | 0.3407|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3407|± |0.0203| |gpqa_extended_zeroshot | 1|none | 0|acc |↑ | 0.3388|± |0.0203| | | |none | 0|acc_norm |↑ | 0.3388|± |0.0203| |gpqa_main_cot_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.0893|± |0.0135| | | |strict-match | 0|exact_match|↑ | 0.0045|± |0.0032| |gpqa_main_cot_zeroshot | 1|flexible-extract | 0|exact_match|↑ | 0.0647|± |0.0116| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_generative_n_shot | 2|flexible-extract | 0|exact_match|↑ | 0.2455|± |0.0204| | | |strict-match | 0|exact_match|↑ | 0.0000|± |0.0000| |gpqa_main_n_shot | 2|none | 0|acc |↑ | 0.3438|± |0.0225| | | |none | 0|acc_norm |↑ | 0.3438|± |0.0225| |gpqa_main_zeroshot | 1|none | 0|acc |↑ | 0.3259|± |0.0222| | | |none | 0|acc_norm |↑ | 0.3259|± |0.0222| |gsm8k | 3|flexible-extract | 5|exact_match|↑ | 0.8484|± |0.0099| | | |strict-match | 5|exact_match|↑ | 0.8567|± |0.0097| |hellaswag | 1|none | 0|acc |↑ | 0.5223|± |0.0050| | | |none | 0|acc_norm |↑ | 0.6833|± |0.0046| |mmlu | 2|none | |acc |↑ | 0.6836|± |0.0037| | - humanities | 2|none | |acc |↑ | 0.5957|± |0.0067| | - formal_logic | 1|none | 0|acc |↑ | 0.6429|± |0.0429| | - high_school_european_history | 1|none | 0|acc |↑ | 0.7939|± |0.0316| | - high_school_us_history | 1|none | 0|acc |↑ | 0.8431|± |0.0255| | - high_school_world_history | 1|none | 0|acc |↑ | 0.8397|± |0.0239| | - international_law | 1|none | 0|acc |↑ | 0.7355|± |0.0403| | - jurisprudence | 1|none | 0|acc |↑ | 0.7407|± |0.0424| | - logical_fallacies | 1|none | 0|acc |↑ | 0.8098|± |0.0308| | - moral_disputes | 1|none | 0|acc |↑ | 0.6965|± |0.0248| | - moral_scenarios | 1|none | 0|acc |↑ | 0.3799|± |0.0162| | - philosophy | 1|none | 0|acc |↑ | 0.7235|± |0.0254| | - prehistory | 1|none | 0|acc |↑ | 0.7438|± |0.0243| | - professional_law | 1|none | 0|acc |↑ | 0.4811|± |0.0128| | - world_religions | 1|none | 0|acc |↑ | 0.7836|± |0.0316| | - other | 2|none | |acc |↑ | 0.7126|± |0.0079| | - business_ethics | 1|none | 0|acc |↑ | 0.7100|± |0.0456| | - clinical_knowledge | 1|none | 0|acc |↑ | 0.7396|± |0.0270| | - college_medicine | 1|none | 0|acc |↑ | 0.7052|± |0.0348| | - global_facts | 1|none | 0|acc |↑ | 0.3400|± |0.0476| | - human_aging | 1|none | 0|acc |↑ | 0.6771|± |0.0314| | - management | 1|none | 0|acc |↑ | 0.8155|± |0.0384| | - marketing | 1|none | 0|acc |↑ | 0.8675|± |0.0222| | - medical_genetics | 1|none | 0|acc |↑ | 0.7600|± |0.0429| | - miscellaneous | 1|none | 0|acc |↑ | 0.7969|± |0.0144| | - nutrition | 1|none | 0|acc |↑ | 0.7255|± |0.0256| | - professional_accounting | 1|none | 0|acc |↑ | 0.5319|± |0.0298| | - professional_medicine | 1|none | 0|acc |↑ | 0.7243|± |0.0271| | - virology | 1|none | 0|acc |↑ | 0.5060|± |0.0389| | - social sciences | 2|none | |acc |↑ | 0.7803|± |0.0074| | - econometrics | 1|none | 0|acc |↑ | 0.6316|± |0.0454| | - high_school_geography | 1|none | 0|acc |↑ | 0.8283|± |0.0269| | - high_school_government_and_politics | 1|none | 0|acc |↑ | 0.8756|± |0.0238| | - high_school_macroeconomics | 1|none | 0|acc |↑ | 0.7462|± |0.0221| | - high_school_microeconomics | 1|none | 0|acc |↑ | 0.8151|± |0.0252| | - high_school_psychology | 1|none | 0|acc |↑ | 0.8716|± |0.0143| | - human_sexuality | 1|none | 0|acc |↑ | 0.7634|± |0.0373| | - professional_psychology | 1|none | 0|acc |↑ | 0.7206|± |0.0182| | - public_relations | 1|none | 0|acc |↑ | 0.6727|± |0.0449| | - security_studies | 1|none | 0|acc |↑ | 0.7061|± |0.0292| | - sociology | 1|none | 0|acc |↑ | 0.8308|± |0.0265| | - us_foreign_policy | 1|none | 0|acc |↑ | 0.8100|± |0.0394| | - stem | 2|none | |acc |↑ | 0.6917|± |0.0080| | - abstract_algebra | 1|none | 0|acc |↑ | 0.6000|± |0.0492| | - anatomy | 1|none | 0|acc |↑ | 0.6148|± |0.0420| | - astronomy | 1|none | 0|acc |↑ | 0.8026|± |0.0324| | - college_biology | 1|none | 0|acc |↑ | 0.8194|± |0.0322| | - college_chemistry | 1|none | 0|acc |↑ | 0.5400|± |0.0501| | - college_computer_science | 1|none | 0|acc |↑ | 0.6700|± |0.0473| | - college_mathematics | 1|none | 0|acc |↑ | 0.5400|± |0.0501| | - college_physics | 1|none | 0|acc |↑ | 0.5882|± |0.0490| | - computer_security | 1|none | 0|acc |↑ | 0.7900|± |0.0409| | - conceptual_physics | 1|none | 0|acc |↑ | 0.7830|± |0.0269| | - electrical_engineering | 1|none | 0|acc |↑ | 0.7310|± |0.0370| | - elementary_mathematics | 1|none | 0|acc |↑ | 0.6799|± |0.0240| | - high_school_biology | 1|none | 0|acc |↑ | 0.8645|± |0.0195| | - high_school_chemistry | 1|none | 0|acc |↑ | 0.7094|± |0.0319| | - high_school_computer_science | 1|none | 0|acc |↑ | 0.8500|± |0.0359| | - high_school_mathematics | 1|none | 0|acc |↑ | 0.4815|± |0.0305| | - high_school_physics | 1|none | 0|acc |↑ | 0.6093|± |0.0398| | - high_school_statistics | 1|none | 0|acc |↑ | 0.6944|± |0.0314| | - machine_learning | 1|none | 0|acc |↑ | 0.6071|± |0.0464| |nq_open | 4|remove_whitespace| 0|exact_match|↑ | 0.0147|± |0.0020| |openbookqa | 1|none | 0|acc |↑ | 0.2960|± |0.0204| | | |none | 0|acc_norm |↑ | 0.4020|± |0.0219| |piqa | 1|none | 0|acc |↑ | 0.7514|± |0.0101| | | |none | 0|acc_norm |↑ | 0.7514|± |0.0101| |qnli | 1|none | 0|acc |↑ | 0.8087|± |0.0053| |sciq | 1|none | 0|acc |↑ | 0.9550|± |0.0066| | | |none | 0|acc_norm |↑ | 0.9320|± |0.0080| |triviaqa | 3|remove_whitespace| 0|exact_match|↑ | 0.2250|± |0.0031| |truthfulqa_gen | 3|none | 0|bleu_acc |↑ | 0.5838|± |0.0173| | | |none | 0|bleu_diff |↑ |12.2904|± |0.9730| | | |none | 0|bleu_max |↑ |29.1140|± |0.8421| | | |none | 0|rouge1_acc |↑ | 0.6095|± |0.0171| | | |none | 0|rouge1_diff|↑ |17.9082|± |1.3731| | | |none | 0|rouge1_max |↑ |54.7069|± |0.9372| | | |none | 0|rouge2_acc |↑ | 0.5520|± |0.0174| | | |none | 0|rouge2_diff|↑ |18.5593|± |1.4928| | | |none | 0|rouge2_max |↑ |42.6485|± |1.1203| | | |none | 0|rougeL_acc |↑ | 0.5961|± |0.0172| | | |none | 0|rougeL_diff|↑ |17.8681|± |1.3823| | | |none | 0|rougeL_max |↑ |52.3619|± |0.9738| |truthfulqa_mc1 | 2|none | 0|acc |↑ | 0.3672|± |0.0169| |truthfulqa_mc2 | 3|none | 0|acc |↑ | 0.5476|± |0.0158| |winogrande | 1|none | 0|acc |↑ | 0.6582|± |0.0133| | Groups |Version| Filter |n-shot| Metric | |Value | |Stderr| |------------------|------:|----------|------|-----------|---|-----:|---|-----:| |bbh | 3|get-answer| |exact_match|↑ |0.7523|± |0.0047| |mmlu | 2|none | |acc |↑ |0.6836|± |0.0037| | - humanities | 2|none | |acc |↑ |0.5957|± |0.0067| | - other | 2|none | |acc |↑ |0.7126|± |0.0079| | - social sciences| 2|none | |acc |↑ |0.7803|± |0.0074| | - stem | 2|none | |acc |↑ |0.6917|± |0.0080| Qwen_Qwen3-4B: 5h 51m 27s ✅ Benchmark completed for Qwen_Qwen3-4B 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 6s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 4s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 4s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 4s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 0s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 4s ✅ Benchmark completed for openai_gpt-oss-20b 🔥 Starting benchmark for openai_gpt-oss-20b openai_gpt-oss-20b: 0h 0m 17s ✅ Benchmark completed for openai_gpt-oss-20b