|
|
--- |
|
|
license: apache-2.0 |
|
|
datasets: |
|
|
- abacusai/MetaMathFewshot |
|
|
- shahules786/orca-chat |
|
|
- anon8231489123/ShareGPT_Vicuna_unfiltered |
|
|
base_model: mistralai/Mistral-7B-v0.1 |
|
|
model-index: |
|
|
- name: Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
results: |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: AI2 Reasoning Challenge (25-Shot) |
|
|
type: ai2_arc |
|
|
config: ARC-Challenge |
|
|
split: test |
|
|
args: |
|
|
num_few_shot: 25 |
|
|
metrics: |
|
|
- type: acc_norm |
|
|
value: 56.4 |
|
|
name: normalized accuracy |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: HellaSwag (10-Shot) |
|
|
type: hellaswag |
|
|
split: validation |
|
|
args: |
|
|
num_few_shot: 10 |
|
|
metrics: |
|
|
- type: acc_norm |
|
|
value: 78.12 |
|
|
name: normalized accuracy |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: MMLU (5-Shot) |
|
|
type: cais/mmlu |
|
|
config: all |
|
|
split: test |
|
|
args: |
|
|
num_few_shot: 5 |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 59.52 |
|
|
name: accuracy |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: TruthfulQA (0-shot) |
|
|
type: truthful_qa |
|
|
config: multiple_choice |
|
|
split: validation |
|
|
args: |
|
|
num_few_shot: 0 |
|
|
metrics: |
|
|
- type: mc2 |
|
|
value: 50.98 |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: Winogrande (5-shot) |
|
|
type: winogrande |
|
|
config: winogrande_xl |
|
|
split: validation |
|
|
args: |
|
|
num_few_shot: 5 |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 76.48 |
|
|
name: accuracy |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
- task: |
|
|
type: text-generation |
|
|
name: Text Generation |
|
|
dataset: |
|
|
name: GSM8k (5-shot) |
|
|
type: gsm8k |
|
|
config: main |
|
|
split: test |
|
|
args: |
|
|
num_few_shot: 5 |
|
|
metrics: |
|
|
- type: acc |
|
|
value: 13.27 |
|
|
name: accuracy |
|
|
source: |
|
|
url: https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard?query=abacusai/Fewshot-Metamath-OrcaVicuna-Mistral-10B |
|
|
name: Open LLM Leaderboard |
|
|
--- |
|
|
|
|
|
```json |
|
|
{ |
|
|
"layer_map": [ |
|
|
[0, 16], |
|
|
[8, 24], |
|
|
[16, 32] |
|
|
] |
|
|
} |
|
|
``` |
|
|
|
|
|
 |
|
|
|
|
|
This model is a variation of [abacusai/Fewshot-Metamath-OrcaVicuna-Mistral](https://huggingface.co/datasets/abacusai/Fewshot-Metamath-OrcaVicuna-Mistral) |
|
|
that builds on the idea of scaling up models by duplicating layers of the base model, in this case |
|
|
[mistralai/Mistral-7B-v0.1](https://huggingface.co/datasets/mistralai/Mistral-7B-v0.1). It relies on the functionality added in this PR |
|
|
https://github.com/huggingface/peft/pull/1368 to train a model with replicated layers without much extra GPU memory. So although there are 48 layers |
|
|
that have lora adapters added, there are only 32 original layers so the memory usage is pretty much the same as the memory usage for the base 7B model. |
|
|
|
|
|
This is just a demonstration model to indicate how this approach can be used and the goal is to apply it to much larger models. For example |
|
|
models like Goliath or MegaDolphin which are effectively 120B models but using this approach they will only use 70B of memory for the base model and |
|
|
a little extra for the LoRA adaption layers. |
|
|
|
|
|
In our training runs we did find a difference in the behavior of the eval loss: |
|
|
|
|
|
 |
|
|
|
|
|
vs the loss curve for the original LoRA finetune of the 7B model |
|
|
|
|
|
 |
|
|
|
|
|
The larger model achieved a best eval loss of 0.3915 vs 0.3971 in a lot fewer steps. |
|
|
|
|
|
Overall, we think this is a promising approach to accessing much larger models without significantly more resources. |
|
|
|
|
|
# Performance on Metrics |
|
|
|
|
|
To do a proper abalation we compared the performance of 4 models trained for ~1 epoch on the combined datasets (Metamath, |
|
|
Orca, ShareGPT). Here are the results: |
|
|
|
|
|
| Model | Trainable Params | Train Loss | Eval Loss | GSM8K | TruthfulQA | |
|
|
| :-----| ------: | ---------: | -------: | ----: | ---------: | |
|
|
| Mistral 7B | 0 | - | - | 0.374 | 0.426 | |
|
|
| Mistral 10B | 0 | - | - | 0.290 | 0.407 | |
|
|
| Mistral 7B + LoRA r=12 | 31M | 0.412 | 0.366 | 0.514 | 0.499 | |
|
|
| Mistral 10B + LoRA r=8 | 31M | 0.401 | 0.363 | 0.663 | 0.540 | |
|
|
|
|
|
This ablation compares the base model (Mistral 7B), expansion using the layer map described here and fine tunes of a lora `r=12` |
|
|
on the base model and `r=8` (to match trainable params). The ablation demonstrates quite clearly that fine tuning the expanded |
|
|
model leads to a significant improvement in metrics even with the same number of trainable parameters (and training steps). |
|
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_abacusai__Fewshot-Metamath-OrcaVicuna-Mistral-10B) |
|
|
|
|
|
| Metric |Value| |
|
|
|---------------------------------|----:| |
|
|
|Avg. |55.79| |
|
|
|AI2 Reasoning Challenge (25-Shot)|56.40| |
|
|
|HellaSwag (10-Shot) |78.12| |
|
|
|MMLU (5-Shot) |59.52| |
|
|
|TruthfulQA (0-shot) |50.98| |
|
|
|Winogrande (5-shot) |76.48| |
|
|
|GSM8k (5-shot) |13.27| |
|
|
|
|
|
|