swj0419 commited on
Commit
69693bf
·
1 Parent(s): 9f1953b
Files changed (1) hide show
  1. README.md +21 -26
README.md CHANGED
@@ -5,10 +5,8 @@ language:
5
  tags:
6
  - moe
7
  - olmo
8
- - olmoe
9
  co2_eq_emissions: 1
10
- datasets:
11
- - allenai/OLMoE-mix-0924
12
  library_name: transformers
13
  ---
14
 
@@ -17,18 +15,28 @@ library_name: transformers
17
 
18
  # Model Summary
19
 
20
- > FlexOlmo-7x7B-1T is a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024 (0924). It yields state-of-the-art performance among models with a similar cost (1B) and is competitive with much larger models like Llama2-13B.
21
 
22
- This information and more can also be found on the [**OLMoE GitHub repository**](https://github.com/allenai/OLMoE).
 
 
23
  - **Paper**: https://arxiv.org/abs/2409.02060
24
- - **Pretraining** [Checkpoints](https://hf.co/allenai/OLMoE-1B-7B-0924), [Code](https://github.com/allenai/OLMo/tree/Muennighoff/MoE), [Data](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) and [Logs](https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3).
25
- - **SFT (Supervised Fine-Tuning)** [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT), [Code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [Data](https://hf.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-sft-logs.txt).
26
- - **DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization)**, [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct), [Preference Data](https://hf.co/datasets/allenai/ultrafeedback_binarized_cleaned), [DPO code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [KTO code](https://github.com/Muennighoff/kto/blob/master/kto.py) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt).
 
 
 
 
 
 
 
 
 
27
 
28
  # Use
29
 
30
  Install `transformers` **from [this source](https://github.com/swj0419/transformers_flexolmo)** and run:
31
-
32
  ```python
33
  from transformers import Olmoe2ForCausalLM, AutoTokenizer
34
  import torch
@@ -44,24 +52,11 @@ out = model.generate(**inputs, max_length=64)
44
  print(tokenizer.decode(out[0]))
45
  ```
46
 
47
- You can list all revisions/branches by installing `huggingface-hub` & running:
48
- ```python
49
- from huggingface_hub import list_repo_refs
50
- out = list_repo_refs("allenai/OLMoE-1B-7B-0924")
51
- branches = [b.name for b in out.branches]
52
- ```
53
-
54
- Important branches:
55
- - `step1200000-tokens5033B`: Pretraining checkpoint used for annealing. There are a few more checkpoints after this one but we did not use them.
56
- - `main`: Checkpoint annealed from `step1200000-tokens5033B` for an additional 100B tokens (23,842 steps). We use this checkpoint for our adaptation (https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT & https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct).
57
- - `fp32`: FP32 version of `main`. The model weights were stored in FP32 during training but we did not observe any performance drop from casting them to BF16 after training so we upload all weights in BF16. If you want the original FP32 checkpoint for `main` you can use this one. You will find that it yields slightly different results but should perform around the same on benchmarks.
58
-
59
-
60
  # Evaluation Snapshot
61
  | Model | **MC9** | **Gen5** | **MMLU** | **MMLU Pro** | **AGIEval** | **BBH** | **Math2** | **NewsG** | **PoemG** | **SciRIFF5** | **Code4** | **Avg.** |
62
  |-------|---------|----------|----------|--------------|-------------|---------|-----------|-----------|-----------|--------------|-----------|----------|
63
  | Prev. Public model | 68.7 | 58.8 | 55.9 | 26.2 | 39.9 | 35.7 | 8.2 | 76.0 | 47.8 | 48.1 | 1.1 | **42.4** |
64
- | **Individual experts** | | | | | | | | | | | | |
65
  | [Math](https://huggingface.co/allenai/Flex-math-2x7B-1T) | 62.5 | 44.3 | 50.6 | 24.1 | 42.0 | 45.6 | **53.1** | 42.6 | 28.0 | 50.7 | 15.8 | **41.8** |
66
  | [Code](https://huggingface.co/allenai/Flex-code-2x7B-1T) | 40.5 | 39.4 | 29.5 | 14.5 | 27.4 | 38.1 | 6.0 | 45.1 | 28.2 | 48.0 | 21.0 | **30.7** |
67
  | Textbook | 64.3 | 52.1 | 56.5 | 27.0 | 39.7 | 40.3 | 13.6 | 57.6 | 51.8 | 51.7 | 3.0 | **41.6** |
@@ -69,16 +64,16 @@ Important branches:
69
  | [Creative Writing](https://huggingface.co/allenai/Flex-creative-2x7B-1T) | 42.7 | 43.9 | 31.5 | 11.6 | 23.3 | 27.6 | 1.7 | 56.9 | **67.5** | 42.4 | 0.0 | **31.7** |
70
  | [Academic](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) | 41.0 | 45.2 | 33.8 | 14.8 | 24.1 | 32.4 | 6.5 | 51.8 | 23.0 | 52.0 | 0.0 | **29.5** |
71
  | [Reddit](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) | 64.7 | 36.5 | 56.1 | 25.5 | 35.5 | 19.7 | 2.5 | 54.1 | 8.6 | 32.7 | 1.7 | **30.7** |
72
- | **Combined model** | | | | | | | | | | | | |
73
  | BTM (top-2) | 68.7 | 57.7 | 59.4 | 28.3 | 43.2 | 44.3 | 23.1 | 73.6 | 54.4 | 46.3 | **24.0** | **47.6** |
74
  | 🔥 **FlexOlmo-7x7B-1T (no router training)** | 70.4 | 60.1 | 60.2 | 30.5 | 47.3 | 47.9 | 79.6 | 66.3 | 60.1 | **53.9** | 14.6 | **53.7** |
75
  | ⏳ [FlexOlmo-7x7B-1T-RT](https://huggingface.co/allenai/FlexOlmo-7x7B-1T-RT) | **70.8** | **59.8** | **60.4** | 30.9 | **45.1** | **46.4** | 48.5 | **80.7** | 62.2 | 54.3 | 17.2 | **52.4** |
76
 
77
 
 
78
  # Citation
79
-
80
  ```bibtex
81
- @misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
82
  title={OLMoE: Open Mixture-of-Experts Language Models},
83
  author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
84
  year={2024},
 
5
  tags:
6
  - moe
7
  - olmo
8
+ - flexolmo
9
  co2_eq_emissions: 1
 
 
10
  library_name: transformers
11
  ---
12
 
 
15
 
16
  # Model Summary
17
 
18
+ > FlexOlmo is a new kind of language model that unlocks a new paradigm of data collaboration. With FlexOlmo, data owners can contribute to the development of open language models without giving up control of their data. There is no need to share raw data directly, and data contributors can decide when their data is active in the model (i.e., who can make use of it), deactivate it at any time, and receive attributions whenever it’s used for inference.
19
 
20
+ > FlexOlmo-7x7B-1T (without router training) is a Mixture-of-Experts LLM with 33B total parameters, combining independently trained experts on public-mix, news, books, code, academic texts, creative writing, and Reddit data.
21
+
22
+ This information and more can also be found:
23
  - **Paper**: https://arxiv.org/abs/2409.02060
24
+ - **Code**: https://github.com/allenai/OLMoE
25
+ - **Data and corresponding models**:
26
+ Corpora and their corresponding models are available as follows:
27
+ | Corpus | Model |
28
+ |--------|-------|
29
+ | [News]() | [Flex-news-2x7B-1T](https://huggingface.co/allenai/Flex-news-2x7B-1T) |
30
+ | [Books]() | [Flex-pes2o-2x7B-1T](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) |
31
+ | [Code]() | [Flex-code-2x7B-1T](https://huggingface.co/allenai/Flex-code-2x7B-1T) |
32
+ | [Academic]() | [Flex-academic-2x7B-1T](https://huggingface.co/allenai/Flex-academic-2x7B-1T) |
33
+ | [Creative Writing]() | [Flex-creative-2x7B-1T](https://huggingface.co/allenai/Flex-creative-2x7B-1T) |
34
+ | [Reddit]() | [Flex-reddit-2x7B-1T](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) |
35
+
36
 
37
  # Use
38
 
39
  Install `transformers` **from [this source](https://github.com/swj0419/transformers_flexolmo)** and run:
 
40
  ```python
41
  from transformers import Olmoe2ForCausalLM, AutoTokenizer
42
  import torch
 
52
  print(tokenizer.decode(out[0]))
53
  ```
54
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55
  # Evaluation Snapshot
56
  | Model | **MC9** | **Gen5** | **MMLU** | **MMLU Pro** | **AGIEval** | **BBH** | **Math2** | **NewsG** | **PoemG** | **SciRIFF5** | **Code4** | **Avg.** |
57
  |-------|---------|----------|----------|--------------|-------------|---------|-----------|-----------|-----------|--------------|-----------|----------|
58
  | Prev. Public model | 68.7 | 58.8 | 55.9 | 26.2 | 39.9 | 35.7 | 8.2 | 76.0 | 47.8 | 48.1 | 1.1 | **42.4** |
59
+ | **Individual experts** |
60
  | [Math](https://huggingface.co/allenai/Flex-math-2x7B-1T) | 62.5 | 44.3 | 50.6 | 24.1 | 42.0 | 45.6 | **53.1** | 42.6 | 28.0 | 50.7 | 15.8 | **41.8** |
61
  | [Code](https://huggingface.co/allenai/Flex-code-2x7B-1T) | 40.5 | 39.4 | 29.5 | 14.5 | 27.4 | 38.1 | 6.0 | 45.1 | 28.2 | 48.0 | 21.0 | **30.7** |
62
  | Textbook | 64.3 | 52.1 | 56.5 | 27.0 | 39.7 | 40.3 | 13.6 | 57.6 | 51.8 | 51.7 | 3.0 | **41.6** |
 
64
  | [Creative Writing](https://huggingface.co/allenai/Flex-creative-2x7B-1T) | 42.7 | 43.9 | 31.5 | 11.6 | 23.3 | 27.6 | 1.7 | 56.9 | **67.5** | 42.4 | 0.0 | **31.7** |
65
  | [Academic](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) | 41.0 | 45.2 | 33.8 | 14.8 | 24.1 | 32.4 | 6.5 | 51.8 | 23.0 | 52.0 | 0.0 | **29.5** |
66
  | [Reddit](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) | 64.7 | 36.5 | 56.1 | 25.5 | 35.5 | 19.7 | 2.5 | 54.1 | 8.6 | 32.7 | 1.7 | **30.7** |
67
+ | **Combined model** |
68
  | BTM (top-2) | 68.7 | 57.7 | 59.4 | 28.3 | 43.2 | 44.3 | 23.1 | 73.6 | 54.4 | 46.3 | **24.0** | **47.6** |
69
  | 🔥 **FlexOlmo-7x7B-1T (no router training)** | 70.4 | 60.1 | 60.2 | 30.5 | 47.3 | 47.9 | 79.6 | 66.3 | 60.1 | **53.9** | 14.6 | **53.7** |
70
  | ⏳ [FlexOlmo-7x7B-1T-RT](https://huggingface.co/allenai/FlexOlmo-7x7B-1T-RT) | **70.8** | **59.8** | **60.4** | 30.9 | **45.1** | **46.4** | 48.5 | **80.7** | 62.2 | 54.3 | 17.2 | **52.4** |
71
 
72
 
73
+
74
  # Citation
 
75
  ```bibtex
76
+ @misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
77
  title={OLMoE: Open Mixture-of-Experts Language Models},
78
  author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
79
  year={2024},