Update README.md
Browse files
README.md
CHANGED
|
@@ -2,6 +2,8 @@
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
|
|
|
|
|
|
| 5 |
base_model:
|
| 6 |
- microsoft/GRIN-MoE
|
| 7 |
- microsoft/Phi-3.5-MoE-instruct
|
|
@@ -9,9 +11,11 @@ pipeline_tag: text-generation
|
|
| 9 |
---
|
| 10 |
## Model Summary
|
| 11 |
|
| 12 |
-
|
| 13 |
|
| 14 |
-
|
|
|
|
|
|
|
| 15 |
π [Phi-3 Technical Report](https://arxiv.org/abs/2404.14219) <br>
|
| 16 |
π [GRIN-MoE](https://arxiv.org/abs/2409.12136) <br>
|
| 17 |
|
|
@@ -19,13 +23,7 @@ The Phi-mini-MoE is a 7.6B total parameters with 2.4B activated parameters, ligh
|
|
| 19 |
|
| 20 |
### Primary Use Cases
|
| 21 |
|
| 22 |
-
The model is intended for commercial and research use in English. The model provides uses for general purpose AI systems and applications which require
|
| 23 |
-
|
| 24 |
-
1) Memory/compute constrained environments
|
| 25 |
-
2) Latency bound scenarios
|
| 26 |
-
3) Strong reasoning (especially math and logic)
|
| 27 |
-
|
| 28 |
-
Our model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.
|
| 29 |
|
| 30 |
### Use Case Considerations
|
| 31 |
|
|
@@ -90,12 +88,12 @@ print(output[0]['generated_text'])
|
|
| 90 |
|
| 91 |
## Benchmarks
|
| 92 |
|
| 93 |
-
To understand the capabilities, we compare Phi-mini-MoE with a set of models over a variety of benchmarks using lm eval.
|
| 94 |
|
| 95 |
| Model | # Total param | # Act. param | MMLU | MMLU pro | BBH | Arc-C (chat) | Human-eval | GSM8K | MT-bench |
|
| 96 |
|----------------------|---------------|--------------|-------|----------|-------|---------------|-------------|--------|----------|
|
| 97 |
| **MoE Models** |||||||||||
|
| 98 |
-
| Phi
|
| 99 |
| Qwen 1.5 MoE | 14B | 2.7B | 60.73 | 26.49 | 42.65 | 67.24 | 46.30 | 53.07 | 6.55 |
|
| 100 |
| DeepSeek V2 Lite | 16B | 2.4B | 56.69 | 17.89 | 36.30 | 61.09 | 54.40 | 63.23 | 6.82 |
|
| 101 |
| OL-MoE | 7B | 1.3B | 54.27 | 20.87 | 38.00 | 55.63 | 37.80 | 71.49 | 6.60 |
|
|
@@ -110,7 +108,7 @@ To understand the capabilities, we compare Phi-mini-MoE with a set of models ove
|
|
| 110 |
| Qwen 2.5 3B | 3B | 3B | 65.06 | 41.00 | 46.61 | 80.20 | 73.80 | 76.57 | 7.60 |
|
| 111 |
| Gemma 3 1B | 1B | 1B | 40.80 | 14.70 | 34.80 | 37.46 | 41.50 | 41.77 | 6.67 |
|
| 112 |
| LLaMA 3.2 1B | 1B | 1B | 46.30 | 18.67 | 35.18 | 49.91 | 35.40 | 44.96 | 5.23 |
|
| 113 |
-
| **
|
| 114 |
| Phi-mini-MoE | 7.6B | 2.4B | 70.68 | 49.68 | 55.27 | 84.91 | 73.80 | 84.89 | 7.59 |
|
| 115 |
| Phi-tiny-MoE | 3.8B | 1.1B | 60.83 | 36.34 | 45.58 | 76.37 | 58.50 | 78.47 | 7.05 |
|
| 116 |
|
|
|
|
| 2 |
license: mit
|
| 3 |
language:
|
| 4 |
- en
|
| 5 |
+
context_length:
|
| 6 |
+
- 4k
|
| 7 |
base_model:
|
| 8 |
- microsoft/GRIN-MoE
|
| 9 |
- microsoft/Phi-3.5-MoE-instruct
|
|
|
|
| 11 |
---
|
| 12 |
## Model Summary
|
| 13 |
|
| 14 |
+
Phi-mini-MoE is a lightweight Mixture of Experts (MoE) model with 7.6B total parameters and 2.4B activated parameters. It is compressed and distilled from the base model shared by [Phi-3.5-MoE](https://huggingface.co/microsoft/Phi-3.5-MoE-instruct) and [GRIN-MoE](https://huggingface.co/microsoft/GRIN-MoE) using the [SlimMoE](http://link.to.slimmoe) approach, then post-trained via supervised fine-tuning and direct preference optimization for instruction following and safety. The model is trained on Phi-3 synthetic data and filtered public documents, with a focus on high-quality, reasoning-dense content. It is part of the SlimMoE series, which includes a smaller variant, [Phi-tiny-MoE](https://huggingface.co/microsoft/Phi-tiny-MoE-instruct), with 3.8B total and 1.1B activated parameters.
|
| 15 |
|
| 16 |
+
|
| 17 |
+
References: <br>
|
| 18 |
+
π [SlimMoE](http:\\link.to.slimmoe) <br>
|
| 19 |
π [Phi-3 Technical Report](https://arxiv.org/abs/2404.14219) <br>
|
| 20 |
π [GRIN-MoE](https://arxiv.org/abs/2409.12136) <br>
|
| 21 |
|
|
|
|
| 23 |
|
| 24 |
### Primary Use Cases
|
| 25 |
|
| 26 |
+
The model is intended for commercial and research use in English. The model provides uses for general purpose AI systems and applications which require memory/compute constrained environments and latency bound scenarios.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 27 |
|
| 28 |
### Use Case Considerations
|
| 29 |
|
|
|
|
| 88 |
|
| 89 |
## Benchmarks
|
| 90 |
|
| 91 |
+
To understand the capabilities, we compare Phi-mini-MoE with a set of models over a variety of benchmarks using lm eval. Detailed evaluation settings can be found in the SlimMoE paper.
|
| 92 |
|
| 93 |
| Model | # Total param | # Act. param | MMLU | MMLU pro | BBH | Arc-C (chat) | Human-eval | GSM8K | MT-bench |
|
| 94 |
|----------------------|---------------|--------------|-------|----------|-------|---------------|-------------|--------|----------|
|
| 95 |
| **MoE Models** |||||||||||
|
| 96 |
+
| Phi-3.5-MoE | 42B | 6.6B | 78.36 | 59.38 | 63.93 | 91.38 | 81.70 | 87.87 | 8.34 |
|
| 97 |
| Qwen 1.5 MoE | 14B | 2.7B | 60.73 | 26.49 | 42.65 | 67.24 | 46.30 | 53.07 | 6.55 |
|
| 98 |
| DeepSeek V2 Lite | 16B | 2.4B | 56.69 | 17.89 | 36.30 | 61.09 | 54.40 | 63.23 | 6.82 |
|
| 99 |
| OL-MoE | 7B | 1.3B | 54.27 | 20.87 | 38.00 | 55.63 | 37.80 | 71.49 | 6.60 |
|
|
|
|
| 108 |
| Qwen 2.5 3B | 3B | 3B | 65.06 | 41.00 | 46.61 | 80.20 | 73.80 | 76.57 | 7.60 |
|
| 109 |
| Gemma 3 1B | 1B | 1B | 40.80 | 14.70 | 34.80 | 37.46 | 41.50 | 41.77 | 6.67 |
|
| 110 |
| LLaMA 3.2 1B | 1B | 1B | 46.30 | 18.67 | 35.18 | 49.91 | 35.40 | 44.96 | 5.23 |
|
| 111 |
+
| **SlimMoE Models** |||||||||||
|
| 112 |
| Phi-mini-MoE | 7.6B | 2.4B | 70.68 | 49.68 | 55.27 | 84.91 | 73.80 | 84.89 | 7.59 |
|
| 113 |
| Phi-tiny-MoE | 3.8B | 1.1B | 60.83 | 36.34 | 45.58 | 76.37 | 58.50 | 78.47 | 7.05 |
|
| 114 |
|