Update README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,32 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
base_model: mistralai/Mistral-7B-v0.1
|
| 4 |
+
datasets:
|
| 5 |
+
- abacusai/MetaMathFewshot
|
| 6 |
+
- shahules786/orca-chat
|
| 7 |
+
- anon8231489123/ShareGPT_Vicuna_unfiltered
|
| 8 |
---
|
| 9 |
+
|
| 10 |
+

|
| 11 |
+
|
| 12 |
+
This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral)
|
| 13 |
+
that builds on the idea of scaling up models by duplicating layers of the base model, in this case
|
| 14 |
+
[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR
|
| 15 |
+
https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers
|
| 16 |
+
that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.
|
| 17 |
+
|
| 18 |
+
This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example
|
| 19 |
+
models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and
|
| 20 |
+
a little extra for the LoRA adaption layers.
|
| 21 |
+
|
| 22 |
+
In our training runs we did find a difference in the behavior of the eval loss:
|
| 23 |
+
|
| 24 |
+

|
| 25 |
+
|
| 26 |
+
vs the loss curve for the original LoRA finetune of the 7B model
|
| 27 |
+
|
| 28 |
+

|
| 29 |
+
|
| 30 |
+
The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.
|
| 31 |
+
|
| 32 |
+
Overall, we think this is a promising approach to accessing much larger models without significantly more resources.
|