allenai
/

FlexOlmo-7x7B-1T

@@ -5,10 +5,8 @@ language:
 tags:
 - moe
 - olmo
-- olmoe
 co2_eq_emissions: 1
-datasets:
-- allenai/OLMoE-mix-0924
 library_name: transformers
 ---
@@ -17,18 +15,28 @@ library_name: transformers
 # Model Summary
-> FlexOlmo-7x7B-1T is a Mixture-of-Experts LLM with 1B active and 7B total parameters released in September 2024 (0924). It yields state-of-the-art performance among models with a similar cost (1B) and is competitive with much larger models like Llama2-13B.
-This information and more can also be found on the [**OLMoE GitHub repository**](https://github.com/allenai/OLMoE).
 - **Paper**: https://arxiv.org/abs/2409.02060
-- **Pretraining** [Checkpoints](https://hf.co/allenai/OLMoE-1B-7B-0924), [Code](https://github.com/allenai/OLMo/tree/Muennighoff/MoE), [Data](https://huggingface.co/datasets/allenai/OLMoE-mix-0924) and [Logs](https://wandb.ai/ai2-llm/olmoe/reports/OLMoE-1B-7B-0924--Vmlldzo4OTcyMjU3).
-- **SFT (Supervised Fine-Tuning)** [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT), [Code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [Data](https://hf.co/datasets/allenai/tulu-v3.1-mix-preview-4096-OLMoE) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-sft-logs.txt).
-- **DPO/KTO (Direct Preference Optimization/Kahneman-Tversky Optimization)**, [Checkpoints](https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct), [Preference Data](https://hf.co/datasets/allenai/ultrafeedback_binarized_cleaned), [DPO code](https://github.com/allenai/open-instruct/tree/olmoe-sft), [KTO code](https://github.com/Muennighoff/kto/blob/master/kto.py) and [Logs](https://github.com/allenai/OLMoE/blob/main/logs/olmoe-dpo-logs.txt).
 # Use
 Install `transformers` **from [this source](https://github.com/swj0419/transformers_flexolmo)** and run:
 ```python
 from transformers import Olmoe2ForCausalLM, AutoTokenizer
 import torch
@@ -44,24 +52,11 @@ out = model.generate(**inputs, max_length=64)
 print(tokenizer.decode(out[0]))
 ```
-You can list all revisions/branches by installing `huggingface-hub` & running:
-```python
-from huggingface_hub import list_repo_refs
-out = list_repo_refs("allenai/OLMoE-1B-7B-0924")
-branches = [b.name for b in out.branches]
-```
-Important branches:
-- `step1200000-tokens5033B`: Pretraining checkpoint used for annealing. There are a few more checkpoints after this one but we did not use them.
-- `main`: Checkpoint annealed from `step1200000-tokens5033B` for an additional 100B tokens (23,842 steps). We use this checkpoint for our adaptation (https://huggingface.co/allenai/OLMoE-1B-7B-0924-SFT & https://huggingface.co/allenai/OLMoE-1B-7B-0924-Instruct).
-- `fp32`: FP32 version of `main`. The model weights were stored in FP32 during training but we did not observe any performance drop from casting them to BF16 after training so we upload all weights in BF16. If you want the original FP32 checkpoint for `main` you can use this one. You will find that it yields slightly different results but should perform around the same on benchmarks.
 # Evaluation Snapshot
 | Model | **MC9** | **Gen5** | **MMLU** | **MMLU Pro** | **AGIEval** | **BBH** | **Math2** | **NewsG** | **PoemG** | **SciRIFF5** | **Code4** | **Avg.** |
 |-------|---------|----------|----------|--------------|-------------|---------|-----------|-----------|-----------|--------------|-----------|----------|
 | Prev. Public model | 68.7 | 58.8 | 55.9 | 26.2 | 39.9 | 35.7 | 8.2 | 76.0 | 47.8 | 48.1 | 1.1 | **42.4** |
-| **Individual experts** | | | | | | | | | | | | |
 | [Math](https://huggingface.co/allenai/Flex-math-2x7B-1T) | 62.5 | 44.3 | 50.6 | 24.1 | 42.0 | 45.6 | **53.1** | 42.6 | 28.0 | 50.7 | 15.8 | **41.8** |
 | [Code](https://huggingface.co/allenai/Flex-code-2x7B-1T) | 40.5 | 39.4 | 29.5 | 14.5 | 27.4 | 38.1 | 6.0 | 45.1 | 28.2 | 48.0 | 21.0 | **30.7** |
 | Textbook | 64.3 | 52.1 | 56.5 | 27.0 | 39.7 | 40.3 | 13.6 | 57.6 | 51.8 | 51.7 | 3.0 | **41.6** |
@@ -69,16 +64,16 @@ Important branches:
 | [Creative Writing](https://huggingface.co/allenai/Flex-creative-2x7B-1T) | 42.7 | 43.9 | 31.5 | 11.6 | 23.3 | 27.6 | 1.7 | 56.9 | **67.5** | 42.4 | 0.0 | **31.7** |
 | [Academic](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) | 41.0 | 45.2 | 33.8 | 14.8 | 24.1 | 32.4 | 6.5 | 51.8 | 23.0 | 52.0 | 0.0 | **29.5** |
 | [Reddit](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) | 64.7 | 36.5 | 56.1 | 25.5 | 35.5 | 19.7 | 2.5 | 54.1 | 8.6 | 32.7 | 1.7 | **30.7** |
-| **Combined model** | | | | | | | | | | | | |
 | BTM (top-2) | 68.7 | 57.7 | 59.4 | 28.3 | 43.2 | 44.3 | 23.1 | 73.6 | 54.4 | 46.3 | **24.0** | **47.6** |
 | 🔥 **FlexOlmo-7x7B-1T (no router training)** | 70.4 | 60.1 | 60.2 | 30.5 | 47.3 | 47.9 | 79.6 | 66.3 | 60.1 | **53.9** | 14.6 | **53.7** |
 | ⏳ [FlexOlmo-7x7B-1T-RT](https://huggingface.co/allenai/FlexOlmo-7x7B-1T-RT) | **70.8** | **59.8** | **60.4** | 30.9 | **45.1** | **46.4** | 48.5 | **80.7** | 62.2 | 54.3 | 17.2 | **52.4** |
 # Citation
-⏳
 ```bibtex
-@misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
       title={OLMoE: Open Mixture-of-Experts Language Models},
       author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
       year={2024},

 tags:
 - moe
 - olmo
+- flexolmo
 co2_eq_emissions: 1
 library_name: transformers
 ---
 # Model Summary
+> FlexOlmo is a new kind of language model that unlocks a new paradigm of data collaboration. With FlexOlmo, data owners can contribute to the development of open language models without giving up control of their data. There is no need to share raw data directly, and data contributors can decide when their data is active in the model (i.e., who can make use of it), deactivate it at any time, and receive attributions whenever it’s used for inference.
+> FlexOlmo-7x7B-1T (without router training) is a Mixture-of-Experts LLM with 33B total parameters, combining independently trained experts on public-mix, news, books, code, academic texts, creative writing, and Reddit data.
+This information and more can also be found:
 - **Paper**: https://arxiv.org/abs/2409.02060
+- **Code**: https://github.com/allenai/OLMoE
+- **Data and corresponding models**:
+⏳ Corpora and their corresponding models are available as follows:
+| Corpus | Model |
+|--------|-------|
+| [News]() | [Flex-news-2x7B-1T](https://huggingface.co/allenai/Flex-news-2x7B-1T) |
+| [Books]() | [Flex-pes2o-2x7B-1T](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) |
+| [Code]() | [Flex-code-2x7B-1T](https://huggingface.co/allenai/Flex-code-2x7B-1T) |
+| [Academic]() | [Flex-academic-2x7B-1T](https://huggingface.co/allenai/Flex-academic-2x7B-1T) |
+| [Creative Writing]() | [Flex-creative-2x7B-1T](https://huggingface.co/allenai/Flex-creative-2x7B-1T) |
+| [Reddit]() | [Flex-reddit-2x7B-1T](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) |
 # Use
 Install `transformers` **from [this source](https://github.com/swj0419/transformers_flexolmo)** and run:
 ```python
 from transformers import Olmoe2ForCausalLM, AutoTokenizer
 import torch
 print(tokenizer.decode(out[0]))
 ```
 # Evaluation Snapshot
 | Model | **MC9** | **Gen5** | **MMLU** | **MMLU Pro** | **AGIEval** | **BBH** | **Math2** | **NewsG** | **PoemG** | **SciRIFF5** | **Code4** | **Avg.** |
 |-------|---------|----------|----------|--------------|-------------|---------|-----------|-----------|-----------|--------------|-----------|----------|
 | Prev. Public model | 68.7 | 58.8 | 55.9 | 26.2 | 39.9 | 35.7 | 8.2 | 76.0 | 47.8 | 48.1 | 1.1 | **42.4** |
+| **Individual experts** |
 | [Math](https://huggingface.co/allenai/Flex-math-2x7B-1T) | 62.5 | 44.3 | 50.6 | 24.1 | 42.0 | 45.6 | **53.1** | 42.6 | 28.0 | 50.7 | 15.8 | **41.8** |
 | [Code](https://huggingface.co/allenai/Flex-code-2x7B-1T) | 40.5 | 39.4 | 29.5 | 14.5 | 27.4 | 38.1 | 6.0 | 45.1 | 28.2 | 48.0 | 21.0 | **30.7** |
 | Textbook | 64.3 | 52.1 | 56.5 | 27.0 | 39.7 | 40.3 | 13.6 | 57.6 | 51.8 | 51.7 | 3.0 | **41.6** |
 | [Creative Writing](https://huggingface.co/allenai/Flex-creative-2x7B-1T) | 42.7 | 43.9 | 31.5 | 11.6 | 23.3 | 27.6 | 1.7 | 56.9 | **67.5** | 42.4 | 0.0 | **31.7** |
 | [Academic](https://huggingface.co/allenai/Flex-pes2o-2x7B-1T) | 41.0 | 45.2 | 33.8 | 14.8 | 24.1 | 32.4 | 6.5 | 51.8 | 23.0 | 52.0 | 0.0 | **29.5** |
 | [Reddit](https://huggingface.co/allenai/Flex-reddit-2x7B-1T) | 64.7 | 36.5 | 56.1 | 25.5 | 35.5 | 19.7 | 2.5 | 54.1 | 8.6 | 32.7 | 1.7 | **30.7** |
+| **Combined model**  |
 | BTM (top-2) | 68.7 | 57.7 | 59.4 | 28.3 | 43.2 | 44.3 | 23.1 | 73.6 | 54.4 | 46.3 | **24.0** | **47.6** |
 | 🔥 **FlexOlmo-7x7B-1T (no router training)** | 70.4 | 60.1 | 60.2 | 30.5 | 47.3 | 47.9 | 79.6 | 66.3 | 60.1 | **53.9** | 14.6 | **53.7** |
 | ⏳ [FlexOlmo-7x7B-1T-RT](https://huggingface.co/allenai/FlexOlmo-7x7B-1T-RT) | **70.8** | **59.8** | **60.4** | 30.9 | **45.1** | **46.4** | 48.5 | **80.7** | 62.2 | 54.3 | 17.2 | **52.4** |
 # Citation
 ```bibtex
+⏳ @misc{muennighoff2024olmoeopenmixtureofexpertslanguage,
       title={OLMoE: Open Mixture-of-Experts Language Models},
       author={Niklas Muennighoff and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Jacob Morrison and Sewon Min and Weijia Shi and Pete Walsh and Oyvind Tafjord and Nathan Lambert and Yuling Gu and Shane Arora and Akshita Bhagia and Dustin Schwenk and David Wadden and Alexander Wettig and Binyuan Hui and Tim Dettmers and Douwe Kiela and Ali Farhadi and Noah A. Smith and Pang Wei Koh and Amanpreet Singh and Hannaneh Hajishirzi},
       year={2024},