Update README.md
Browse files
README.md
CHANGED
|
@@ -100,19 +100,36 @@ Apertus by default supports a context length up to 65,536 tokens.
|
|
| 100 |
|
| 101 |
Apertus supports tool use
|
| 102 |
|
| 103 |
-
###
|
| 104 |
|
| 105 |
-
|
| 106 |
|
| 107 |
## Evaluation
|
| 108 |
|
| 109 |
-
|
| 110 |
-
|
| 111 |
-
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 116 |
|
| 117 |
## Training
|
| 118 |
|
|
|
|
| 100 |
|
| 101 |
Apertus supports tool use
|
| 102 |
|
| 103 |
+
### Deployment
|
| 104 |
|
| 105 |
+
Deployment of the models is directly supported by the newest versions of [Transformers](https://github.com/huggingface/transformers), [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), and also for running on-device with [MLX](https://github.com/ml-explore/mlx-lm),
|
| 106 |
|
| 107 |
## Evaluation
|
| 108 |
|
| 109 |
+
**Pretraining Evaluation:** Performance (%) of Apertus models on *general language understanding* tasks (higher is better) compared to other pretrained models.
|
| 110 |
+
|
| 111 |
+
| **Model** | **Avg** | **ARC** | **HellaSwag** | **WinoGrande** | **XNLI** | **XCOPA** | **PIQA** |
|
| 112 |
+
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|
| 113 |
+
| **Fully Open Models** | | | | | | | |
|
| 114 |
+
| **Apertus-8B** | 65.8 | 72.7 | 59.8 | 70.6 | 45.2 | 66.5 | 79.8 |
|
| 115 |
+
| **Apertus-70B** | 67.5 | 70.6 | 64.0 | 73.3 | 45.3 | 69.8 | 81.9 |
|
| 116 |
+
| OLMo2-7B | 64.0 | 72.9 | 60.4 | 74.5 | 40.4 | 55.2 | 80.9 |
|
| 117 |
+
| OLMo2-32B | 67.7 | 76.2 | 66.7 | 78.6 | 42.9 | 60.1 | 82.1 |
|
| 118 |
+
| EuroLLM-1.7B | 54.8 | 57.2 | 44.9 | 58.1 | 40.7 | 55.7 | 72.4 |
|
| 119 |
+
| EuroLLM-9B | 62.8 | 67.9 | 57.9 | 68.8 | 41.5 | 61.1 | 79.6 |
|
| 120 |
+
| SmolLM2-1.7B | 58.5 | 66.1 | 52.4 | 65.6 | 37.6 | 52.3 | 77.0 |
|
| 121 |
+
| SmolLM3-3B | 61.6 | 68.6 | 56.4 | 68.1 | 40.5 | 58.2 | 77.7 |
|
| 122 |
+
| Poro-34B | 61.7 | 65.7 | 57.9 | 70.6 | 41.6 | 56.0 | 78.5 |
|
| 123 |
+
| **Open-Weight Models** | | | | | | | |
|
| 124 |
+
| Llama3.1-8B | 65.4 | 71.6 | 60.0 | 73.4 | 45.3 | 61.8 | 80.1 |
|
| 125 |
+
| Llama3.1-70B | 67.3 | 74.4 | 56.5 | 79.4 | 44.3 | 66.7 | 82.3 |
|
| 126 |
+
| Qwen2.5-7B | 64.4 | 69.6 | 60.1 | 72.8 | 43.3 | 61.7 | 78.7 |
|
| 127 |
+
| Qwen2.5-72B | 69.8 | 76.2 | 67.5 | 78.0 | 46.9 | 68.2 | 82.0 |
|
| 128 |
+
| Qwen3-32B | 67.8 | 75.6 | 64.0 | 73.8 | 44.4 | 67.9 | 80.9 |
|
| 129 |
+
| Llama4-Scout-16x17B | 67.9 | 74.7 | 66.8 | 73.2 | 43.5 | 67.7 | 81.2 |
|
| 130 |
+
| GPT-OSS-20B | 58.1 | 67.0 | 41.5 | 66.5 | 37.4 | 60.4 | 75.6 |
|
| 131 |
+
|
| 132 |
+
Many additional benchmark evaluations, for pretraining and posttraining phases, multilingual evaluations in around hundred languages, and long context evaluations are provided in Section 5 of the [Apertus_Tech_Report.pdf](https://github.com/swiss-ai/apertus-tech-report/blob/main/Apertus_Tech_Report.pdf)
|
| 133 |
|
| 134 |
## Training
|
| 135 |
|