hanhainebula commited on Sep 23

Commit

4b39f29

verified ·

1 Parent(s): d50d230

Upload folder using huggingface_hub

Browse files

Files changed (28) hide show

README.md +2 -21
imgs/bright-performance.png +2 -2
search_results/examples/EVAL/eval_results.json +134 -74
search_results/examples/aops-examples.json +1 -1
search_results/examples/biology-examples.json +2 -2
search_results/examples/earth_science-examples.json +2 -2
search_results/examples/economics-examples.json +2 -2
search_results/examples/leetcode-examples.json +2 -2
search_results/examples/pony-examples.json +2 -2
search_results/examples/psychology-examples.json +2 -2
search_results/examples/robotics-examples.json +2 -2
search_results/examples/stackoverflow-examples.json +2 -2
search_results/examples/sustainable_living-examples.json +2 -2
search_results/examples/theoremqa_questions-examples.json +2 -2
search_results/examples/theoremqa_theorems-examples.json +1 -1
search_results/gpt4_reason/EVAL/eval_results.json +130 -70
search_results/gpt4_reason/aops-gpt4_reason.json +1 -1
search_results/gpt4_reason/biology-gpt4_reason.json +2 -2
search_results/gpt4_reason/earth_science-gpt4_reason.json +2 -2
search_results/gpt4_reason/economics-gpt4_reason.json +2 -2
search_results/gpt4_reason/leetcode-gpt4_reason.json +2 -2
search_results/gpt4_reason/pony-gpt4_reason.json +2 -2
search_results/gpt4_reason/psychology-gpt4_reason.json +2 -2
search_results/gpt4_reason/robotics-gpt4_reason.json +2 -2
search_results/gpt4_reason/stackoverflow-gpt4_reason.json +2 -2
search_results/gpt4_reason/sustainable_living-gpt4_reason.json +2 -2
search_results/gpt4_reason/theoremqa_questions-gpt4_reason.json +2 -2
search_results/gpt4_reason/theoremqa_theorems-gpt4_reason.json +1 -1

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ license: apache-2.0
 For more details please refer to our Github: [BGE-Reasoner](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Reasoner).
-**BGE-Reasoner-Embed-Qwen3-8B-0923** is an embedding model trained for reasoning-intensive retrieval tasks, based on [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). It achieves an nDCG@10 of 37.2 on the [BRIGHT](https://brightbenchmark.github.io/) benchmark with original query, demonstrating its strong capability in reasoning-intensive retrieval tasks.
 The search results on BRIGHT are available [here](https://huggingface.co/BAAI/bge-reasoner-embed-qwen3-8b-0923/tree/main/search_results).
@@ -130,29 +130,10 @@ print(scores.cpu().tolist())
 ## Evaluation
-BGE-Reasoner-Embed-Qwen3-8B-0923 exhibits strong performance in reasoning-intensive retrieval tasks, as demonstrated by its results (nDCG@10 = 37.2 using original query) on the BRIGHT benchmark.
 <img src="./imgs/bright-performance.png" alt="BRIGHT Performance" style="zoom:200%;" />
-Note:
-- "**Avg - ALL**" refers to the average performance across **all 12 datasets** in the BRIGHT benchmark.
-- "**Avg - SE**" refers to the average performance across the **7 datasets in the StackExchange subset** of the BRIGHT benchmark.
-- "**Avg - CD**" refers to the average performance across the **2 datasets in the Coding subset** of the BRIGHT benchmark.
-- "**Avg - MT**" refers to the average performance across the **3 datasets in the Theorem-based subset** of the BRIGHT benchmark.
-> Sources of Results:
->
-> [1] https://arxiv.org/pdf/2407.12883
->
-> [2] https://arxiv.org/pdf/2504.20595
->
-> [3] https://github.com/Debrup-61/RaDeR
->
-> [4] https://seed1-5-embedding.github.io
->
-> [5] https://arxiv.org/pdf/2508.07995
->
-> *: results evaluated with our script
 ## Citation

 For more details please refer to our Github: [BGE-Reasoner](https://github.com/FlagOpen/FlagEmbedding/tree/master/research/BGE_Reasoner).
+**BGE-Reasoner-Embed-Qwen3-8B-0923** is an embedding model trained for reasoning-intensive retrieval tasks, based on [Qwen/Qwen3-8B](https://huggingface.co/Qwen/Qwen3-8B). It achieves an nDCG@10 of 37.1 on the [BRIGHT](https://brightbenchmark.github.io/) benchmark with original query, demonstrating its strong capability in reasoning-intensive retrieval tasks.
 The search results on BRIGHT are available [here](https://huggingface.co/BAAI/bge-reasoner-embed-qwen3-8b-0923/tree/main/search_results).
 ## Evaluation
+BGE-Reasoner-Embed-Qwen3-8B-0923 exhibits strong performance in reasoning-intensive retrieval tasks, as demonstrated by its results (nDCG@10 = 37.1 using original query) on the BRIGHT benchmark.
 <img src="./imgs/bright-performance.png" alt="BRIGHT Performance" style="zoom:200%;" />
 ## Citation

imgs/bright-performance.png CHANGED Viewed

Git LFS Details

SHA256: a99db7e24959989ecdcb8b0f00c8a2e8517ec5fe51d860e4880950682a9af566
Pointer size: 131 Bytes
Size of remote file: 126 kB

Git LFS Details

SHA256: 7c53cefc13bd0ffef36357bddddd34a26d29ba7e289c0e3076aebf16fad0f575
Pointer size: 131 Bytes
Size of remote file: 126 kB

search_results/examples/EVAL/eval_results.json CHANGED Viewed

@@ -1,146 +1,206 @@
 {
-    "biology-examples": {
-        "ndcg_at_1": 0.50485,
-        "ndcg_at_10": 0.54407,
-        "map_at_1": 0.16222,
-        "map_at_10": 0.43293,
-        "recall_at_1": 0.16222,
-        "recall_at_10": 0.63582,
-        "precision_at_1": 0.50485,
-        "precision_at_10": 0.22524,
-        "mrr_at_1": 0.49515,
-        "mrr_at_10": 0.61066
     },
-    "theoremqa_theorems-examples": {
-        "ndcg_at_1": 0.34211,
-        "ndcg_at_10": 0.47592,
-        "map_at_1": 0.18499,
-        "map_at_10": 0.39117,
-        "recall_at_1": 0.18499,
-        "recall_at_10": 0.65116,
-        "precision_at_1": 0.34211,
-        "precision_at_10": 0.11974,
-        "mrr_at_1": 0.34211,
-        "mrr_at_10": 0.45854
     },
-    "psychology-examples": {
-        "ndcg_at_1": 0.36634,
-        "ndcg_at_10": 0.45155,
-        "map_at_1": 0.16264,
-        "map_at_10": 0.33572,
-        "recall_at_1": 0.16264,
-        "recall_at_10": 0.52951,
-        "precision_at_1": 0.36634,
-        "precision_at_10": 0.19406,
-        "mrr_at_1": 0.36634,
-        "mrr_at_10": 0.47711
     },
     "robotics-examples": {
         "ndcg_at_1": 0.28713,
         "ndcg_at_10": 0.31993,
         "map_at_1": 0.14029,
         "map_at_10": 0.23822,
         "recall_at_1": 0.14029,
         "recall_at_10": 0.37368,
         "precision_at_1": 0.28713,
         "precision_at_10": 0.11386,
         "mrr_at_1": 0.28713,
-        "mrr_at_10": 0.37628
     },
     "aops-examples": {
         "ndcg_at_1": 0.13514,
         "ndcg_at_10": 0.13305,
         "map_at_1": 0.03062,
         "map_at_10": 0.07937,
         "recall_at_1": 0.03062,
         "recall_at_10": 0.1598,
         "precision_at_1": 0.13514,
         "precision_at_10": 0.07207,
         "mrr_at_1": 0.13514,
-        "mrr_at_10": 0.20256
     },
     "sustainable_living-examples": {
         "ndcg_at_1": 0.35185,
         "ndcg_at_10": 0.37341,
         "map_at_1": 0.13027,
         "map_at_10": 0.27505,
         "recall_at_1": 0.13027,
         "recall_at_10": 0.45011,
         "precision_at_1": 0.35185,
         "precision_at_10": 0.16019,
         "mrr_at_1": 0.35185,
-        "mrr_at_10": 0.43959
     },
     "leetcode-examples": {
         "ndcg_at_1": 0.28169,
         "ndcg_at_10": 0.32309,
         "map_at_1": 0.17535,
         "map_at_10": 0.25267,
         "recall_at_1": 0.17535,
         "recall_at_10": 0.41808,
         "precision_at_1": 0.28169,
         "precision_at_10": 0.07254,
         "mrr_at_1": 0.28169,
-        "mrr_at_10": 0.37478
-    },
-    "earth_science-examples": {
-        "ndcg_at_1": 0.57759,
-        "ndcg_at_10": 0.55426,
-        "map_at_1": 0.23269,
-        "map_at_10": 0.44959,
-        "recall_at_1": 0.23269,
-        "recall_at_10": 0.58342,
-        "precision_at_1": 0.57759,
-        "precision_at_10": 0.2181,
-        "mrr_at_1": 0.57759,
-        "mrr_at_10": 0.67135
     },
     "economics-examples": {
         "ndcg_at_1": 0.29126,
         "ndcg_at_10": 0.33832,
         "map_at_1": 0.13804,
         "map_at_10": 0.23934,
         "recall_at_1": 0.13804,
         "recall_at_10": 0.35798,
         "precision_at_1": 0.29126,
         "precision_at_10": 0.15049,
         "mrr_at_1": 0.29126,
-        "mrr_at_10": 0.37666
-    },
-    "theoremqa_questions-examples": {
-        "ndcg_at_1": 0.39691,
-        "ndcg_at_10": 0.4124,
-        "map_at_1": 0.22809,
-        "map_at_10": 0.37578,
-        "recall_at_1": 0.22809,
-        "recall_at_10": 0.45814,
-        "precision_at_1": 0.39691,
-        "precision_at_10": 0.09639,
-        "mrr_at_1": 0.39691,
-        "mrr_at_10": 0.43798
     },
     "stackoverflow-examples": {
         "ndcg_at_1": 0.30769,
         "ndcg_at_10": 0.34329,
         "map_at_1": 0.12812,
         "map_at_10": 0.26281,
         "recall_at_1": 0.12812,
         "recall_at_10": 0.43316,
         "precision_at_1": 0.30769,
         "precision_at_10": 0.12222,
         "mrr_at_1": 0.2906,
-        "mrr_at_10": 0.3824
     },
-    "pony-examples": {
-        "ndcg_at_1": 0.24107,
-        "ndcg_at_10": 0.1903,
-        "map_at_1": 0.0161,
-        "map_at_10": 0.05666,
-        "recall_at_1": 0.0161,
-        "recall_at_10": 0.09687,
-        "precision_at_1": 0.24107,
-        "precision_at_10": 0.17054,
-        "mrr_at_1": 0.24107,
-        "mrr_at_10": 0.37504
     }
 }

 {
+    "earth_science-examples": {
+        "ndcg_at_1": 0.57759,
+        "ndcg_at_10": 0.55426,
+        "ndcg_at_100": 0.64815,
+        "map_at_1": 0.23269,
+        "map_at_10": 0.44959,
+        "map_at_100": 0.49319,
+        "recall_at_1": 0.23269,
+        "recall_at_10": 0.58342,
+        "recall_at_100": 0.87067,
+        "precision_at_1": 0.57759,
+        "precision_at_10": 0.2181,
+        "precision_at_100": 0.03966,
+        "mrr_at_1": 0.57759,
+        "mrr_at_10": 0.67135,
+        "mrr_at_100": 0.67713
     },
+    "theoremqa_questions-examples": {
+        "ndcg_at_1": 0.39691,
+        "ndcg_at_10": 0.4124,
+        "ndcg_at_100": 0.45757,
+        "map_at_1": 0.22809,
+        "map_at_10": 0.37578,
+        "map_at_100": 0.38535,
+        "recall_at_1": 0.22809,
+        "recall_at_10": 0.45814,
+        "recall_at_100": 0.64633,
+        "precision_at_1": 0.39691,
+        "precision_at_10": 0.09639,
+        "precision_at_100": 0.01309,
+        "mrr_at_1": 0.39691,
+        "mrr_at_10": 0.43798,
+        "mrr_at_100": 0.44688
     },
+    "pony-examples": {
+        "ndcg_at_1": 0.24107,
+        "ndcg_at_10": 0.18695,
+        "ndcg_at_100": 0.27913,
+        "map_at_1": 0.0161,
+        "map_at_10": 0.05544,
+        "map_at_100": 0.09322,
+        "recall_at_1": 0.0161,
+        "recall_at_10": 0.09518,
+        "recall_at_100": 0.38397,
+        "precision_at_1": 0.24107,
+        "precision_at_10": 0.16696,
+        "precision_at_100": 0.07187,
+        "mrr_at_1": 0.24107,
+        "mrr_at_10": 0.37267,
+        "mrr_at_100": 0.38838
     },
     "robotics-examples": {
         "ndcg_at_1": 0.28713,
         "ndcg_at_10": 0.31993,
+        "ndcg_at_100": 0.40945,
         "map_at_1": 0.14029,
         "map_at_10": 0.23822,
+        "map_at_100": 0.26881,
         "recall_at_1": 0.14029,
         "recall_at_10": 0.37368,
+        "recall_at_100": 0.708,
         "precision_at_1": 0.28713,
         "precision_at_10": 0.11386,
+        "precision_at_100": 0.0298,
         "mrr_at_1": 0.28713,
+        "mrr_at_10": 0.37628,
+        "mrr_at_100": 0.38883
     },
     "aops-examples": {
         "ndcg_at_1": 0.13514,
         "ndcg_at_10": 0.13305,
+        "ndcg_at_100": 0.21311,
         "map_at_1": 0.03062,
         "map_at_10": 0.07937,
+        "map_at_100": 0.10056,
         "recall_at_1": 0.03062,
         "recall_at_10": 0.1598,
+        "recall_at_100": 0.39924,
         "precision_at_1": 0.13514,
         "precision_at_10": 0.07207,
+        "precision_at_100": 0.01946,
         "mrr_at_1": 0.13514,
+        "mrr_at_10": 0.20256,
+        "mrr_at_100": 0.21432
     },
     "sustainable_living-examples": {
         "ndcg_at_1": 0.35185,
         "ndcg_at_10": 0.37341,
+        "ndcg_at_100": 0.48528,
         "map_at_1": 0.13027,
         "map_at_10": 0.27505,
+        "map_at_100": 0.32648,
         "recall_at_1": 0.13027,
         "recall_at_10": 0.45011,
+        "recall_at_100": 0.80612,
         "precision_at_1": 0.35185,
         "precision_at_10": 0.16019,
+        "precision_at_100": 0.0375,
         "mrr_at_1": 0.35185,
+        "mrr_at_10": 0.43959,
+        "mrr_at_100": 0.4535
     },
     "leetcode-examples": {
         "ndcg_at_1": 0.28169,
         "ndcg_at_10": 0.32309,
+        "ndcg_at_100": 0.38938,
         "map_at_1": 0.17535,
         "map_at_10": 0.25267,
+        "map_at_100": 0.26771,
         "recall_at_1": 0.17535,
         "recall_at_10": 0.41808,
+        "recall_at_100": 0.69519,
         "precision_at_1": 0.28169,
         "precision_at_10": 0.07254,
+        "precision_at_100": 0.01254,
         "mrr_at_1": 0.28169,
+        "mrr_at_10": 0.37478,
+        "mrr_at_100": 0.38255
     },
     "economics-examples": {
         "ndcg_at_1": 0.29126,
         "ndcg_at_10": 0.33832,
+        "ndcg_at_100": 0.43577,
         "map_at_1": 0.13804,
         "map_at_10": 0.23934,
+        "map_at_100": 0.29841,
         "recall_at_1": 0.13804,
         "recall_at_10": 0.35798,
+        "recall_at_100": 0.72009,
         "precision_at_1": 0.29126,
         "precision_at_10": 0.15049,
+        "precision_at_100": 0.04738,
         "mrr_at_1": 0.29126,
+        "mrr_at_10": 0.37666,
+        "mrr_at_100": 0.38917
     },
     "stackoverflow-examples": {
         "ndcg_at_1": 0.30769,
         "ndcg_at_10": 0.34329,
+        "ndcg_at_100": 0.44943,
         "map_at_1": 0.12812,
         "map_at_10": 0.26281,
+        "map_at_100": 0.30158,
         "recall_at_1": 0.12812,
         "recall_at_10": 0.43316,
+        "recall_at_100": 0.77889,
         "precision_at_1": 0.30769,
         "precision_at_10": 0.12222,
+        "precision_at_100": 0.03265,
         "mrr_at_1": 0.2906,
+        "mrr_at_10": 0.3824,
+        "mrr_at_100": 0.39268
     },
+    "biology-examples": {
+        "ndcg_at_1": 0.50485,
+        "ndcg_at_10": 0.54407,
+        "ndcg_at_100": 0.62536,
+        "map_at_1": 0.16222,
+        "map_at_10": 0.43293,
+        "map_at_100": 0.46687,
+        "recall_at_1": 0.16222,
+        "recall_at_10": 0.63582,
+        "recall_at_100": 0.91484,
+        "precision_at_1": 0.50485,
+        "precision_at_10": 0.22524,
+        "precision_at_100": 0.03301,
+        "mrr_at_1": 0.49515,
+        "mrr_at_10": 0.61066,
+        "mrr_at_100": 0.61562
+    },
+    "theoremqa_theorems-examples": {
+        "ndcg_at_1": 0.34211,
+        "ndcg_at_10": 0.47592,
+        "ndcg_at_100": 0.54161,
+        "map_at_1": 0.18499,
+        "map_at_10": 0.39117,
+        "map_at_100": 0.41019,
+        "recall_at_1": 0.18499,
+        "recall_at_10": 0.65116,
+        "recall_at_100": 0.89583,
+        "precision_at_1": 0.34211,
+        "precision_at_10": 0.11974,
+        "precision_at_100": 0.01763,
+        "mrr_at_1": 0.34211,
+        "mrr_at_10": 0.45854,
+        "mrr_at_100": 0.46713
+    },
+    "psychology-examples": {
+        "ndcg_at_1": 0.36634,
+        "ndcg_at_10": 0.45155,
+        "ndcg_at_100": 0.52282,
+        "map_at_1": 0.16264,
+        "map_at_10": 0.33572,
+        "map_at_100": 0.38264,
+        "recall_at_1": 0.16264,
+        "recall_at_10": 0.52951,
+        "recall_at_100": 0.8142,
+        "precision_at_1": 0.36634,
+        "precision_at_10": 0.19406,
+        "precision_at_100": 0.04168,
+        "mrr_at_1": 0.36634,
+        "mrr_at_10": 0.47711,
+        "mrr_at_100": 0.48678
     }
 }

search_results/examples/aops-examples.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "eval_name": "bright_short",
-    "model_name": "model_name",
     "reranker_name": "NoReranker",
     "split": "examples",
     "dataset_name": "aops",

 {
     "eval_name": "bright_short",
+    "model_name": "bge-reasoner-embed-qwen3-8b-0923",
     "reranker_name": "NoReranker",
     "split": "examples",
     "dataset_name": "aops",

search_results/examples/biology-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:73ff1cf7ff2e2aaaf8c4589982ae31198a58744ee5aba0393a10a4c4cb040f1c
-size 16553902

 version https://git-lfs.github.com/spec/v1
+oid sha256:928f1da50e99b17ac672c010c980b311507b0b8e93428514d56ad0a10892cefd
+size 16553924

search_results/examples/earth_science-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a3c02bb3fd382207d49a63897dd34ec396c9e1175e35921fd3c1b95fcf5620f2
-size 18149199

 version https://git-lfs.github.com/spec/v1
+oid sha256:a00edb97fd961347afda6259846fe9b684174f275b915adea3eec6eea93033b0
+size 18149221

search_results/examples/economics-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ca54dce3b507334fd5fb1752bc7979ce1ffc19529339f04e3b02bc3aeea77e65
-size 16602998

 version https://git-lfs.github.com/spec/v1
+oid sha256:d6a9cd4068cbcbba2532bd3fe2092907dbc82ca9fb3c114d70189cc0950df6d5
+size 16603020

search_results/examples/leetcode-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:5605b0a3e4529214669438e02af8060bdc07cd168872742e0b13b0f16422b0c0
-size 18343889

 version https://git-lfs.github.com/spec/v1
+oid sha256:fd47cf17fcc232b564d218738592d804194428b91433b0f6a5893561f48e04d2
+size 18343911

search_results/examples/pony-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:2df5caf0f85432924ec0848713162cce488c6f63f63c2da36907c4aab1981c2c
-size 14633844

 version https://git-lfs.github.com/spec/v1
+oid sha256:ff7599e7f23965bf0ed6a5fbc08e6150a93cd1591860882bd4be60a9f02dc147
+size 14638419

search_results/examples/psychology-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:27abb7eeb6ef1fd323c8d0edd4693711ccb2f3ab475ad7cb852d79486ec64c15
-size 15426536

 version https://git-lfs.github.com/spec/v1
+oid sha256:0f434872fcdbb50f7dc83d31ba9764da9d86bd6007301a9495912b185175d737
+size 15426558

search_results/examples/robotics-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ccaa49fe930eef8a75a4f95775041d720fd98f590ee4a2e43639d85060f0f841
-size 14420936

 version https://git-lfs.github.com/spec/v1
+oid sha256:de79b06df718828c14911206d08913ee23982cbb366506adc948b996cd47fcc4
+size 14420958

search_results/examples/stackoverflow-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e8583c0c7ad3583159cdbfe1af10f6d7d02867d1d966a649517158809a0cdcdf
-size 19083216

 version https://git-lfs.github.com/spec/v1
+oid sha256:90839303d3929463e570827fa26b2d96d51d38bc69327bc2a1ee9f61b61299f4
+size 19083238

search_results/examples/sustainable_living-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:424f15ba7d04f4e36a09940fe5ca4c8e1186904cb8cf828341121911aa4a2f74
-size 17535702

 version https://git-lfs.github.com/spec/v1
+oid sha256:5f9da842cc39fdfa47065b4fd5378ca6beed0fe51293d11cb2390e2edcbf2fbb
+size 17535724

search_results/examples/theoremqa_questions-examples.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:11ef6db56d097e389508bece9f359d2e5cac0a5d9c5966343fa0d20475366e84
-size 14691873

 version https://git-lfs.github.com/spec/v1
+oid sha256:2af96c492ee0666553820a5057d2ca20c36a3125794897099e5c6e13f40c3575
+size 14691895

search_results/examples/theoremqa_theorems-examples.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "eval_name": "bright_short",
-    "model_name": "model_name",
     "reranker_name": "NoReranker",
     "split": "examples",
     "dataset_name": "theoremqa_theorems",

 {
     "eval_name": "bright_short",
+    "model_name": "bge-reasoner-embed-qwen3-8b-0923",
     "reranker_name": "NoReranker",
     "split": "examples",
     "dataset_name": "theoremqa_theorems",

search_results/gpt4_reason/EVAL/eval_results.json CHANGED Viewed

@@ -1,146 +1,206 @@
 {
     "biology-gpt4_reason": {
         "ndcg_at_1": 0.58252,
         "ndcg_at_10": 0.6238,
         "map_at_1": 0.20161,
         "map_at_10": 0.52321,
         "recall_at_1": 0.20161,
         "recall_at_10": 0.70649,
         "precision_at_1": 0.58252,
         "precision_at_10": 0.24757,
         "mrr_at_1": 0.57282,
-        "mrr_at_10": 0.67721
-    },
-    "stackoverflow-gpt4_reason": {
-        "ndcg_at_1": 0.33333,
-        "ndcg_at_10": 0.39248,
-        "map_at_1": 0.15138,
-        "map_at_10": 0.3128,
-        "recall_at_1": 0.15138,
-        "recall_at_10": 0.48351,
-        "precision_at_1": 0.33333,
-        "precision_at_10": 0.13761,
-        "mrr_at_1": 0.34188,
-        "mrr_at_10": 0.43379
     },
     "sustainable_living-gpt4_reason": {
         "ndcg_at_1": 0.36111,
         "ndcg_at_10": 0.40345,
         "map_at_1": 0.15859,
         "map_at_10": 0.3077,
         "recall_at_1": 0.15859,
         "recall_at_10": 0.47684,
         "precision_at_1": 0.36111,
         "precision_at_10": 0.16944,
         "mrr_at_1": 0.36111,
-        "mrr_at_10": 0.46219
     },
     "leetcode-gpt4_reason": {
         "ndcg_at_1": 0.23239,
         "ndcg_at_10": 0.28348,
         "map_at_1": 0.14894,
         "map_at_10": 0.21761,
         "recall_at_1": 0.14894,
         "recall_at_10": 0.37946,
         "precision_at_1": 0.23239,
         "precision_at_10": 0.0662,
         "mrr_at_1": 0.23239,
-        "mrr_at_10": 0.31958
     },
     "pony-gpt4_reason": {
         "ndcg_at_1": 0.4375,
-        "ndcg_at_10": 0.31629,
         "map_at_1": 0.02577,
-        "map_at_10": 0.09796,
         "recall_at_1": 0.02577,
-        "recall_at_10": 0.1572,
         "precision_at_1": 0.4375,
-        "precision_at_10": 0.27321,
         "mrr_at_1": 0.4375,
-        "mrr_at_10": 0.58011
     },
     "aops-gpt4_reason": {
         "ndcg_at_1": 0.0991,
         "ndcg_at_10": 0.12337,
         "map_at_1": 0.02301,
         "map_at_10": 0.07287,
         "recall_at_1": 0.02301,
         "recall_at_10": 0.15546,
         "precision_at_1": 0.0991,
         "precision_at_10": 0.07748,
         "mrr_at_1": 0.0991,
-        "mrr_at_10": 0.17159
     },
     "theoremqa_questions-gpt4_reason": {
         "ndcg_at_1": 0.39175,
         "ndcg_at_10": 0.39407,
         "map_at_1": 0.2268,
         "map_at_10": 0.36037,
         "recall_at_1": 0.2268,
         "recall_at_10": 0.42649,
         "precision_at_1": 0.39175,
         "precision_at_10": 0.08918,
         "mrr_at_1": 0.39175,
-        "mrr_at_10": 0.42989
-    },
-    "theoremqa_theorems-gpt4_reason": {
-        "ndcg_at_1": 0.31579,
-        "ndcg_at_10": 0.41518,
-        "map_at_1": 0.18061,
-        "map_at_10": 0.34361,
-        "recall_at_1": 0.18061,
-        "recall_at_10": 0.55138,
-        "precision_at_1": 0.31579,
-        "precision_at_10": 0.10526,
-        "mrr_at_1": 0.31579,
-        "mrr_at_10": 0.41206
-    },
-    "earth_science-gpt4_reason": {
-        "ndcg_at_1": 0.68966,
-        "ndcg_at_10": 0.62277,
-        "map_at_1": 0.27438,
-        "map_at_10": 0.51581,
-        "recall_at_1": 0.27438,
-        "recall_at_10": 0.62818,
-        "precision_at_1": 0.68966,
-        "precision_at_10": 0.23966,
-        "mrr_at_1": 0.68966,
-        "mrr_at_10": 0.75591
-    },
-    "economics-gpt4_reason": {
-        "ndcg_at_1": 0.25243,
-        "ndcg_at_10": 0.35251,
-        "map_at_1": 0.1225,
-        "map_at_10": 0.24774,
-        "recall_at_1": 0.1225,
-        "recall_at_10": 0.39583,
-        "precision_at_1": 0.25243,
-        "precision_at_10": 0.16214,
-        "mrr_at_1": 0.25243,
-        "mrr_at_10": 0.36646
     },
     "psychology-gpt4_reason": {
         "ndcg_at_1": 0.43564,
         "ndcg_at_10": 0.49823,
         "map_at_1": 0.19875,
         "map_at_10": 0.37395,
         "recall_at_1": 0.19875,
         "recall_at_10": 0.56675,
         "precision_at_1": 0.43564,
         "precision_at_10": 0.20396,
         "mrr_at_1": 0.43564,
-        "mrr_at_10": 0.55344
     },
-    "robotics-gpt4_reason": {
-        "ndcg_at_1": 0.30693,
-        "ndcg_at_10": 0.3438,
-        "map_at_1": 0.13674,
-        "map_at_10": 0.25764,
-        "recall_at_1": 0.13674,
-        "recall_at_10": 0.39831,
-        "precision_at_1": 0.30693,
-        "precision_at_10": 0.12277,
-        "mrr_at_1": 0.30693,
-        "mrr_at_10": 0.40114
     }
 }

 {
+    "earth_science-gpt4_reason": {
+        "ndcg_at_1": 0.68966,
+        "ndcg_at_10": 0.62277,
+        "ndcg_at_100": 0.70079,
+        "map_at_1": 0.27438,
+        "map_at_10": 0.51581,
+        "map_at_100": 0.55688,
+        "recall_at_1": 0.27438,
+        "recall_at_10": 0.62818,
+        "recall_at_100": 0.87189,
+        "precision_at_1": 0.68966,
+        "precision_at_10": 0.23966,
+        "precision_at_100": 0.04017,
+        "mrr_at_1": 0.68966,
+        "mrr_at_10": 0.75591,
+        "mrr_at_100": 0.76114
+    },
+    "economics-gpt4_reason": {
+        "ndcg_at_1": 0.25243,
+        "ndcg_at_10": 0.35251,
+        "ndcg_at_100": 0.4404,
+        "map_at_1": 0.1225,
+        "map_at_10": 0.24774,
+        "map_at_100": 0.30356,
+        "recall_at_1": 0.1225,
+        "recall_at_10": 0.39583,
+        "recall_at_100": 0.7242,
+        "precision_at_1": 0.25243,
+        "precision_at_10": 0.16214,
+        "precision_at_100": 0.0467,
+        "mrr_at_1": 0.25243,
+        "mrr_at_10": 0.36646,
+        "mrr_at_100": 0.37617
+    },
+    "robotics-gpt4_reason": {
+        "ndcg_at_1": 0.30693,
+        "ndcg_at_10": 0.3438,
+        "ndcg_at_100": 0.42992,
+        "map_at_1": 0.13674,
+        "map_at_10": 0.25764,
+        "map_at_100": 0.29106,
+        "recall_at_1": 0.13674,
+        "recall_at_10": 0.39831,
+        "recall_at_100": 0.71796,
+        "precision_at_1": 0.30693,
+        "precision_at_10": 0.12277,
+        "precision_at_100": 0.02941,
+        "mrr_at_1": 0.30693,
+        "mrr_at_10": 0.40114,
+        "mrr_at_100": 0.41358
+    },
     "biology-gpt4_reason": {
         "ndcg_at_1": 0.58252,
         "ndcg_at_10": 0.6238,
+        "ndcg_at_100": 0.69335,
         "map_at_1": 0.20161,
         "map_at_10": 0.52321,
+        "map_at_100": 0.55406,
         "recall_at_1": 0.20161,
         "recall_at_10": 0.70649,
+        "recall_at_100": 0.94461,
         "precision_at_1": 0.58252,
         "precision_at_10": 0.24757,
+        "precision_at_100": 0.03408,
         "mrr_at_1": 0.57282,
+        "mrr_at_10": 0.67721,
+        "mrr_at_100": 0.68104
     },
     "sustainable_living-gpt4_reason": {
         "ndcg_at_1": 0.36111,
         "ndcg_at_10": 0.40345,
+        "ndcg_at_100": 0.50126,
         "map_at_1": 0.15859,
         "map_at_10": 0.3077,
+        "map_at_100": 0.35186,
         "recall_at_1": 0.15859,
         "recall_at_10": 0.47684,
+        "recall_at_100": 0.80474,
         "precision_at_1": 0.36111,
         "precision_at_10": 0.16944,
+        "precision_at_100": 0.03602,
         "mrr_at_1": 0.36111,
+        "mrr_at_10": 0.46219,
+        "mrr_at_100": 0.47221
     },
     "leetcode-gpt4_reason": {
         "ndcg_at_1": 0.23239,
         "ndcg_at_10": 0.28348,
+        "ndcg_at_100": 0.35507,
         "map_at_1": 0.14894,
         "map_at_10": 0.21761,
+        "map_at_100": 0.23384,
         "recall_at_1": 0.14894,
         "recall_at_10": 0.37946,
+        "recall_at_100": 0.67911,
         "precision_at_1": 0.23239,
         "precision_at_10": 0.0662,
+        "precision_at_100": 0.01218,
         "mrr_at_1": 0.23239,
+        "mrr_at_10": 0.31958,
+        "mrr_at_100": 0.32962
     },
     "pony-gpt4_reason": {
         "ndcg_at_1": 0.4375,
+        "ndcg_at_10": 0.31554,
+        "ndcg_at_100": 0.40062,
         "map_at_1": 0.02577,
+        "map_at_10": 0.0977,
+        "map_at_100": 0.16595,
         "recall_at_1": 0.02577,
+        "recall_at_10": 0.15626,
+        "recall_at_100": 0.51074,
         "precision_at_1": 0.4375,
+        "precision_at_10": 0.27143,
+        "precision_at_100": 0.09634,
         "mrr_at_1": 0.4375,
+        "mrr_at_10": 0.58209,
+        "mrr_at_100": 0.58798
     },
     "aops-gpt4_reason": {
         "ndcg_at_1": 0.0991,
         "ndcg_at_10": 0.12337,
+        "ndcg_at_100": 0.19981,
         "map_at_1": 0.02301,
         "map_at_10": 0.07287,
+        "map_at_100": 0.09278,
         "recall_at_1": 0.02301,
         "recall_at_10": 0.15546,
+        "recall_at_100": 0.38814,
         "precision_at_1": 0.0991,
         "precision_at_10": 0.07748,
+        "precision_at_100": 0.01865,
         "mrr_at_1": 0.0991,
+        "mrr_at_10": 0.17159,
+        "mrr_at_100": 0.18433
     },
     "theoremqa_questions-gpt4_reason": {
         "ndcg_at_1": 0.39175,
         "ndcg_at_10": 0.39407,
+        "ndcg_at_100": 0.44046,
         "map_at_1": 0.2268,
         "map_at_10": 0.36037,
+        "map_at_100": 0.37029,
         "recall_at_1": 0.2268,
         "recall_at_10": 0.42649,
+        "recall_at_100": 0.61392,
         "precision_at_1": 0.39175,
         "precision_at_10": 0.08918,
+        "precision_at_100": 0.01273,
         "mrr_at_1": 0.39175,
+        "mrr_at_10": 0.42989,
+        "mrr_at_100": 0.43888
     },
     "psychology-gpt4_reason": {
         "ndcg_at_1": 0.43564,
         "ndcg_at_10": 0.49823,
+        "ndcg_at_100": 0.56403,
         "map_at_1": 0.19875,
         "map_at_10": 0.37395,
+        "map_at_100": 0.41725,
         "recall_at_1": 0.19875,
         "recall_at_10": 0.56675,
+        "recall_at_100": 0.83745,
         "precision_at_1": 0.43564,
         "precision_at_10": 0.20396,
+        "precision_at_100": 0.04386,
         "mrr_at_1": 0.43564,
+        "mrr_at_10": 0.55344,
+        "mrr_at_100": 0.56089
     },
+    "stackoverflow-gpt4_reason": {
+        "ndcg_at_1": 0.33333,
+        "ndcg_at_10": 0.39248,
+        "ndcg_at_100": 0.49454,
+        "map_at_1": 0.15138,
+        "map_at_10": 0.3128,
+        "map_at_100": 0.3534,
+        "recall_at_1": 0.15138,
+        "recall_at_10": 0.48351,
+        "recall_at_100": 0.81653,
+        "precision_at_1": 0.33333,
+        "precision_at_10": 0.13761,
+        "precision_at_100": 0.03299,
+        "mrr_at_1": 0.34188,
+        "mrr_at_10": 0.43379,
+        "mrr_at_100": 0.44454
+    },
+    "theoremqa_theorems-gpt4_reason": {
+        "ndcg_at_1": 0.31579,
+        "ndcg_at_10": 0.41518,
+        "ndcg_at_100": 0.50262,
+        "map_at_1": 0.18061,
+        "map_at_10": 0.34361,
+        "map_at_100": 0.37053,
+        "recall_at_1": 0.18061,
+        "recall_at_10": 0.55138,
+        "recall_at_100": 0.86952,
+        "precision_at_1": 0.31579,
+        "precision_at_10": 0.10526,
+        "precision_at_100": 0.01724,
+        "mrr_at_1": 0.31579,
+        "mrr_at_10": 0.41206,
+        "mrr_at_100": 0.42468
     }
 }

search_results/gpt4_reason/aops-gpt4_reason.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "eval_name": "bright_short",
-    "model_name": "model_name",
     "reranker_name": "NoReranker",
     "split": "gpt4_reason",
     "dataset_name": "aops",

 {
     "eval_name": "bright_short",
+    "model_name": "bge-reasoner-embed-qwen3-8b-0923",
     "reranker_name": "NoReranker",
     "split": "gpt4_reason",
     "dataset_name": "aops",

search_results/gpt4_reason/biology-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:a314b2948bd2a48f44d4d949645051600cd1cba4e600dc9b89c31753570e4c8a
-size 16518223

 version https://git-lfs.github.com/spec/v1
+oid sha256:ca5bc5385b40a26e4dae800a1e3ef5da67292abc64e25e1cd5f93087609b6551
+size 16518245

search_results/gpt4_reason/earth_science-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1f2fa207b74ef2808c22bbed1d086941cd177cd442c26b7ea8cdd4991bce3411
-size 18093660

 version https://git-lfs.github.com/spec/v1
+oid sha256:e327e4d348e68ba0f82c378b47201730922e77f6ab801a7edbb7f35a2162b39c
+size 18093682

search_results/gpt4_reason/economics-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:41293661ab3cec60d7620fe33a4689ccbd0f5b67ee5f8dbf5ba589fb0f70fb0f
-size 16558285

 version https://git-lfs.github.com/spec/v1
+oid sha256:a04232d66f4b8ebdbe04605a31e58220cf882687ac175b709b53d7e68dc852c4
+size 16558307

search_results/gpt4_reason/leetcode-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4520f31f78b5d0543af77eee8c7f29ddfe7d3c17960778ca654893490fac36d9
-size 18454399

 version https://git-lfs.github.com/spec/v1
+oid sha256:4173c80bc28c902b3185acb767ca736293420b414d0d635c6c4356df0f40f41e
+size 18454421

search_results/gpt4_reason/pony-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:f9f79f2d325a64c1362fcf7cc7707b6ddb3adfbd164c246bb70dcac003d7765f
-size 14625227

 version https://git-lfs.github.com/spec/v1
+oid sha256:535d52285a03f186f6a2cec113d561f159dc31f33a718624c677b7766c324489
+size 14626574

search_results/gpt4_reason/psychology-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:1347e9db837913ab175f6b5daee4b8e1e3babb69ba3a43396a5b49cec2655d0d
-size 15322577

 version https://git-lfs.github.com/spec/v1
+oid sha256:6934acd47feaf0d4660f27dd6698974b1e39bc3b33aa366e642bbdaa0e98ef80
+size 15322599

search_results/gpt4_reason/robotics-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:6fd310df6a895d2d566674614b1976af7a85248ceca02e583d7c2b61bcde49d2
-size 14451325

 version https://git-lfs.github.com/spec/v1
+oid sha256:da4db05bc8a4a6373f0c55ba14e9eeece3183309a5686a7c21d2c4524f28754a
+size 14451347

search_results/gpt4_reason/stackoverflow-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:9f98cedc37011e2e1900a716e3330342562bf028dfd6ea6fd056a7c1b44fb293
-size 19155033

 version https://git-lfs.github.com/spec/v1
+oid sha256:597ef4f451d4a8fed29f02e19a3ba9f48a5bbeb831e2b730542c006998dbd3ed
+size 19155055

search_results/gpt4_reason/sustainable_living-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:ef85268f068dccb8f806d7f6d5deeb748e9d72f7ca82cd092ee3c93b4b9d52fd
-size 17382433

 version https://git-lfs.github.com/spec/v1
+oid sha256:6834741e9fb06e733e8709f48df1f4765a6cd7a3f17908f4a9b3505f5373066d
+size 17382455

search_results/gpt4_reason/theoremqa_questions-gpt4_reason.json CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:96c0abc7644d10c4a13532f2420a6a3ef9d952917ef422e7cb0658b199f783c2
-size 14333959

 version https://git-lfs.github.com/spec/v1
+oid sha256:ff71cae336466a1d4fbedac67a9f278e6b45c1f322d5c13e0d692ec57af8742c
+size 14333981

search_results/gpt4_reason/theoremqa_theorems-gpt4_reason.json CHANGED Viewed

@@ -1,6 +1,6 @@
 {
     "eval_name": "bright_short",
-    "model_name": "model_name",
     "reranker_name": "NoReranker",
     "split": "gpt4_reason",
     "dataset_name": "theoremqa_theorems",

 {
     "eval_name": "bright_short",
+    "model_name": "bge-reasoner-embed-qwen3-8b-0923",
     "reranker_name": "NoReranker",
     "split": "gpt4_reason",
     "dataset_name": "theoremqa_theorems",