Spaces:

opencompass
/

ATLAS

Running

App Files Files Community

“pangjh3” commited on 10 days ago

Commit

9b1c42f

1 Parent(s): a0e98c9

modified: .gitignore

Browse files

modified: .pre-commit-config.yaml
modified: Makefile
modified: pyproject.toml
modified: src/about.py

Files changed (5) hide show

.gitignore +3 -0
.pre-commit-config.yaml +3 -0
Makefile +3 -0
pyproject.toml +3 -0
src/about.py +17 -3

.gitignore CHANGED Viewed

	@@ -13,3 +13,6 @@ eval-results-bk/
13	logs/
14
15


13	logs/
14
15
16	+
17	+
18	+

.pre-commit-config.yaml CHANGED Viewed

	@@ -53,3 +53,6 @@ repos:
53	- id: ruff
54
55


53	- id: ruff
54
55
56	+
57	+
58	+

Makefile CHANGED Viewed

	@@ -13,3 +13,6 @@ quality:
13	ruff check .
14
15


13	ruff check .
14
15
16	+
17	+
18	+

pyproject.toml CHANGED Viewed

	@@ -13,3 +13,6 @@ line_length = 119
13	line-length = 119
14
15


13	line-length = 119
14
15
16	+
17	+
18	+

src/about.py CHANGED Viewed

@@ -26,7 +26,19 @@ NUM_FEWSHOT = 0 # Change with your few shot
 # Your leaderboard name
-TITLE = """<h1 align="center" id="space-title">ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning</h1>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
@@ -43,10 +55,12 @@ INTRODUCTION_TEXT = """
 - **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis
 ## 📊 Evaluation Metrics
-- **Accuracy (%)**: Overall correctness of predictions across all domains, judged by LLM-as-Judge (OpenAI o4-mini / Qwen3-235B-A22B)
 - **mG-Pass@2**: Multi-generation Pass rate for 2 predictions (measures consistency of model outputs)
 - **mG-Pass@4**: Multi-generation Pass rate for 4 predictions (measures stability of reasoning capabilities)
 The leaderboard displays model performance sorted by average accuracy, with domain-specific scores reflecting strengths in different scientific fields. All metrics are derived from the ATLAS validation/test set (≈800 expert-created original problems).
 """
 # Which evaluations are you running? how can people reproduce what you have?
@@ -86,7 +100,7 @@ To reproduce our evaluation results:
 """
 EVALUATION_QUEUE_TEXT = """
-## Submit Your ATLAS Results
 Results can be submitted as evaluation outputs in JSON format. Each submission should include predictions and reasoning content for all test questions.

 # Your leaderboard name
+# TITLE = """<h1 align="center" id="space-title">ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning</h1>"""
+TITLE = """<h1 align="center" id="space-title">ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning</h1>
+<div align="center">
+    <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">
+        <img src="https://img.shields.io/badge/Dataset%20License-CC%20BY--NC--SA%204.0-blue.svg" alt="Dataset License: CC BY-NC-SA 4.0">
+    </a>
+    <a href="https://arxiv.org/abs/2511.14366">
+        <img src="https://img.shields.io/badge/Paper-arXiv-red.svg" alt="Paper">
+    </a>
+    <a href="https://huggingface.co/datasets/opencompass/ATLAS">
+        <img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Dataset-orange" alt="Hugging Face Dataset">
+    </a>
+</div>"""
 # What does your leaderboard evaluate?
 INTRODUCTION_TEXT = """
 - **Materials Science** - Composite materials, metal materials, organic polymer materials, and material synthesis
 ## 📊 Evaluation Metrics
+- **Accuracy (%)**: Overall correctness of predictions across all domains, judged by LLM-as-Judge (OpenAI o4-mini / GPT-OSS-120B)
 - **mG-Pass@2**: Multi-generation Pass rate for 2 predictions (measures consistency of model outputs)
 - **mG-Pass@4**: Multi-generation Pass rate for 4 predictions (measures stability of reasoning capabilities)
 The leaderboard displays model performance sorted by average accuracy, with domain-specific scores reflecting strengths in different scientific fields. All metrics are derived from the ATLAS validation/test set (≈800 expert-created original problems).
+#### 📧 If you have any questions about submissions or leaderboards, please contact: [email protected]
 """
 # Which evaluations are you running? how can people reproduce what you have?
 """
 EVALUATION_QUEUE_TEXT = """
+## Submit Your ATLAS Test Set Results
 Results can be submitted as evaluation outputs in JSON format. Each submission should include predictions and reasoning content for all test questions.