Update README.md
Browse files
README.md
CHANGED
|
@@ -9,17 +9,19 @@ This repo contains the **context-based instruction synthesizer** used in our pap
|
|
| 9 |
We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
|
| 10 |
|
| 11 |
<p align='center'>
|
| 12 |
-
<img src="
|
| 13 |
</p>
|
| 14 |
|
| 15 |
## Synthesize Instruction-Response Pairs from Any Raw Corproa
|
| 16 |
We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
|
| 17 |
|
| 18 |
<p align='center'>
|
| 19 |
-
<img src="./
|
| 20 |
</p>
|
| 21 |
|
| 22 |
-
|
|
|
|
|
|
|
| 23 |
```python
|
| 24 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 25 |
|
|
|
|
| 9 |
We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
|
| 10 |
|
| 11 |
<p align='center'>
|
| 12 |
+
<img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
|
| 13 |
</p>
|
| 14 |
|
| 15 |
## Synthesize Instruction-Response Pairs from Any Raw Corproa
|
| 16 |
We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
|
| 17 |
|
| 18 |
<p align='center'>
|
| 19 |
+
<img src="./https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
|
| 20 |
</p>
|
| 21 |
|
| 22 |
+
The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
|
| 23 |
+
|
| 24 |
+
To prompt the synthesizer to generate instruction-response pairs based on a given raw text:
|
| 25 |
```python
|
| 26 |
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 27 |
|