abacusai
/

Fewshot-Metamath-OrcaVicuna-Mistral-10B

Model card Files Files and versions

siddartha-abacus commited on Jan 24, 2024

Commit

2bf967b

·

verified ·

1 Parent(s): 4ccd2d0

Update README.md

Files changed (1) hide show

README.md +29 -0

README.md CHANGED Viewed

@@ -1,3 +1,32 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+base_model: mistralai/Mistral-7B-v0.1
+datasets:
+- abacusai/MetaMathFewshot
+- shahules786/orca-chat
+- anon8231489123/ShareGPT_Vicuna_unfiltered
 ---
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f6b02e1f8f67c73bd05/pf4d6FA7DriRtVq5HCkxd.png)
+This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral)
+that builds on the idea of scaling up models by duplicating layers of the base model, in this case
+[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR
+https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers
+that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model.
+This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example
+models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and
+a little extra for the LoRA adaption layers.
+In our training runs we did find a difference in the behavior of the eval loss:
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/vszXUSmANBw6EFjn4sX1N.png)
+vs the loss curve for the original LoRA finetune of the 7B model
+![image/png](https://cdn-uploads.huggingface.co/production/uploads/64c14f95cac5f9ba52bbcd7f/dis1P2MD_Rsyw81aIVByS.png)
+The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps.
+Overall, we think this is a promising approach to accessing much larger models without significantly more resources.