instruction-pretrain
/

instruction-synthesizer

@@ -9,17 +9,19 @@ This repo contains the **context-based instruction synthesizer** used in our pap
 We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
 <p align='center'>
-    <img src="./hf_intro.png" width="400">
 </p>
 ## Synthesize Instruction-Response Pairs from Any Raw Corproa
 We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
 <p align='center'>
-    <img src="./hf_synthesizer.png" width="700">
 </p>
-An example script to prompt the synthesizer to generate instruction-response pairs based on the given raw text is:
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer

 We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continued pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.
 <p align='center'>
+    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
 </p>
 ## Synthesize Instruction-Response Pairs from Any Raw Corproa
 We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text.
 <p align='center'>
+    <img src="./https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
 </p>
+The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
+To prompt the synthesizer to generate instruction-response pairs based on a given raw text:
 ```python
 from transformers import AutoModelForCausalLM, AutoTokenizer