File size: 12,004 Bytes

57f1782
 
 
 
6f50084
 
57f1782
8b2f39a
85c4729
8b2f39a
6f50084
8b2f39a
 
4eab7f5
8b2f39a
 
7a0b2da
 
 
 
 
 
6f50084
 
 
 
 
 
7a0b2da
6f50084
5e932be
6f50084
 
 
1f19d3c
 
6f50084
e3ed28d
6f50084
 
8b2f39a
 
6f50084
8b2f39a
 
e3ed28d
ec2d76f
 
8b2f39a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3ed28d
8b2f39a
 
 
 
 
e3ed28d
8b2f39a
 
 
 
 
 
 
3587d50
1f19d3c
 
e3ed28d
1f19d3c
 
 
 
e3ed28d
 
 
 
 
 
 
1f19d3c
e3ed28d
1f19d3c
e3ed28d
1f19d3c
e3ed28d
1f19d3c
 
 
e3ed28d
1f19d3c
 
 
 
e3ed28d
 
1f19d3c
 
 
 
e3ed28d
1f19d3c
 
e3ed28d
1f19d3c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
e3ed28d
 
6f50084
3587d50
 
6f50084
2b7ed69
 
 
 
 
 
 
 
 
 
6f50084
3587d50
95649a1
 
 
 
 
 
 
3587d50
95649a1

---
license: apache-2.0
language:
- en
datasets:
- instruction-pretrain/ft-instruction-synthesizer-collection
---
# Instruction Pre-Training: Language Models are Supervised Multitask Learners
This repo contains the **context-based instruction synthesizer** in our paper [Instruction Pre-Training: Language Models are Supervised Multitask Learners](https://huggingface.co/papers/2406.14491).

We explore supervised multitask pre-training by proposing ***Instruction Pre-Training***, a framework that scalably augments massive raw corpora with instruction-response pairs to pre-train language models. The instruction-response pairs are generated by an efficient instruction synthesizer built on open-source models. In our experiments, we synthesize 200M instruction-response pairs covering 40+ task categories to verify the effectiveness of *Instruction Pre-Training*. ***Instruction Pre-Training* outperforms *Vanilla Pre-training* in both general pre-training from scratch and domain-adaptive continual pre-training.** In pre-training from scratch, *Instruction Pre-Training* not only improves pre-trained base models but also benefits more from further instruction tuning. In continual pre-training, *Instruction Pre-Training* enables Llama3-8B to be comparable to or even outperform Llama3-70B.

<p align='center'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/vRdsFIVQptbNaGiZ18Lih.png" width="400">
</p>

**************************** **Updates** ****************************
* 2024/7/15: We scaled up the pre-trained tokens from 100B to 250B, with the number of synthesized instruction-response pairs reaching 500M! Below, we show the performance trend on downstream tasks throughout the pre-training process:
<p align='center'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0okCfRkC6uALTfuNxt0Fa.png" width="700">
</p>
* 2024/6/21: Released the [paper](https://huggingface.co/papers/2406.14491), [code](https://github.com/microsoft/LMOps), and [resources](https://huggingface.co/instruction-pretrain)

## Resources
**🤗 We share our data and models with example usages, feel free to open any issues or discussions! 🤗**

- Context-Based Instruction Synthesizer: [instruction-synthesizer](https://huggingface.co/instruction-pretrain/instruction-synthesizer)
- Fine-Tuning Data for the Synthesizer: [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)
- General Models Pre-Trained from Scratch (on 100B tokes):
  - [InstructLM-500M](https://huggingface.co/instruction-pretrain/InstructLM-500M)
  - [InstructLM-1.3B](https://huggingface.co/instruction-pretrain/InstructLM-1.3B)
- Domain-Specific Models Pre-Trained from Llama3-8B:
  - [Finance-Llama3-8B](https://huggingface.co/instruction-pretrain/finance-Llama3-8B)
  - [Biomedicine-Llama3-8B](https://huggingface.co/instruction-pretrain/medicine-Llama3-8B)
- General Instruction-Augmented Corpora: [general-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/general-instruction-augmented-corpora)
- Domain-Specific Instruction-Augmented Corpora (no finance data to avoid ethical issues): [medicine-instruction-augmented-corpora](https://huggingface.co/datasets/instruction-pretrain/medicine-instruction-augmented-corpora)

## Synthesize Instruction-Response Pairs to Augment Any Raw Corpora
We conduct multitask fine-tuning on a language model to develop an instruction synthesizer capable of generating instruction-response pairs from any raw text. The fine-tuning data are available at [ft-instruction-synthesizer-collection](https://huggingface.co/datasets/instruction-pretrain/ft-instruction-synthesizer-collection)


<p align='center'>
    <img src="https://cdn-uploads.huggingface.co/production/uploads/66711d2ee12fa6cc5f5dfc89/0889QyG59QM3rPeZlcTzZ.png" width="700">
</p>

### Basic Usage: Synthesize instruction-response pairs based on a given raw text

**💗 Here is an amazing demo that implements our approach: [davanstrien/instruction-synthesizer](https://huggingface.co/spaces/davanstrien/instruction-synthesizer) 💗**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("instruction-pretrain/instruction-synthesizer")
tokenizer = AutoTokenizer.from_pretrained("instruction-pretrain/instruction-synthesizer")

# Put your raw text here:
context = '''Free Fishing Weekend in NYS Slated
This weekend (June 28th-29th) New Yorkers may fish for free without a license in any of the state's 7,500 lakes and ponds or 50,000 miles of rivers and streams. In addition, there are a number of free events and fishing clinics taking place across the state to encourage New Yorkers to enjoy the great outdoors. For more information, visit'''

def parse_pred(pred):
    """Extract the list of instruction-response pairs from the prediction"""
    QA_str_list = pred.split('</END>')
    if not pred.endswith('</END>'):
        QA_str_list = QA_str_list[:-1]

    QA_list = []
    raw_questions = []
    for QA_str in QA_str_list:
        try:
            assert len(QA_str.split('<ANS>')) == 2, f'invalid QA string: {QA_str}'
            Q_str, A_str = QA_str.split('<ANS>')
            Q_str, A_str = Q_str.strip(), A_str.strip()
            assert Q_str.startswith('<QUE>'), f'invalid question string: {Q_str} in QA_str: {QA_str}'
            assert len(A_str) > 0, f'invalid answer string in QA_str: {QA_str}'
            Q_str = Q_str.replace('<QUE>', '').strip()
            assert Q_str.lower() not in raw_questions, f'duplicate question: {Q_str}'
            QA_list.append({'Q': Q_str, 'A': A_str})
            raw_questions.append(Q_str.lower())
        except:
            pass

    return QA_list

def get_instruction_response_pairs(context):
    '''Prompt the synthesizer to generate instruction-response pairs based on the given context'''
    prompt = f'<s> <CON> {context} </CON>\n\n'
    inputs = tokenizer(prompt, add_special_tokens=False, return_tensors="pt").input_ids.to(model.device)
    outputs = model.generate(input_ids=inputs, max_new_tokens=400, do_sample=False)[0]

    pred_start = int(inputs.shape[-1])
    pred = tokenizer.decode(outputs[pred_start:], skip_special_tokens=True)
    return parse_pred(pred)

# Get the generated instruction-response paris
instruction_response_pairs = get_instruction_response_pairs(context)

# Print out the results
print(f'# Context:\n{context}\n')
for index, pair in enumerate(instruction_response_pairs):
    print(f'## Instruction {index + 1}:\n{pair["Q"]}\n## Response {index + 1}:\n{pair["A"]}\n')
```

### Advanced Usage: Convert Raw Corpora into Instruction-Augmented Corpora at Scale
1. Set up dependencies:

```bash
git clone https://github.com/microsoft/LMOps.git
cd LMOps/instruction_pretrain
```

Install vLLM with pip or from [source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):

```bash
pip install vllm
```

2. Synthesize and Templify Few-shot Examples for Pre-Training

A one-shot example consists of a piece of raw text followed by its instruction-response pairs. We conduct multi-round inferece to synthesize few-shot examples: the instruction-response pairs of different raw texts share the same pattern.

Suppose there are N pieces of raw text in the corpora, and you would like to covert them into M-shot examples:

```python
from vllm import LLM, SamplingParams
from utils.read_compre import get_dataset, cook_pt_entries, run

# Put your list of raw texts here
raw_texts = [
    "Genetically and medically susceptible workers.\nThe likelihood of an individual becoming ill from a hazardous material or condition is strongly influenced by both their genetic makeup and their underlying state of health. Although the past decade has seen great advances in understanding human variation in health and genetic polymorphisms and in the diagnosis and treatment of disease, much less progress has been made in effectively using this information to protect worker health. Scientific evidence for increased susceptibility often is weak and rarely satisfies legal thresholds for sufficient risk to warrant exclusion from a particular job. When public safety is a major concern, many legally mandated exclusions are not well justified. Medical opinions about fitness to work should be based upon a systematic and credible analysis of the condition, its relationship to ability and risk for a particular job, and knowledge of possible accommodations. Conclusions should reflect the limitations of scientific knowledge and guidance from antidiscrimination legislation.",
    "Exclusive Breastfeeding for Twin Babies and Its Influencing Factors: A Study in East Java, Indonesia.\nThis study aimed to identify the factors that influence the success of exclusive breastfeeding in twins. This cross-sectional study was conducted on 184 mothers who had twins aged 6-23 months in Malang Raya, East Java, Indonesia and used the consecutive sampling technique. The data was collected through distributing questionnaires containing questions related to knowledge about exclusive breastfeeding, breastfeeding self-efficacy, and the support of family and certified health workers. Multinomial regression statistical test results show that the most influential factor for the success of exclusive breastfeeding with twins was breastfeeding self-efficacy (OR 0.111; 95% CI 0.033-0.387). A high level of breastfeeding self-efficacy can increase a mother's confidence to be able to provide exclusive breastfeeding for twins. This study suggests that nurses can provide breastfeeding counselling to improve breastfeeding self-efficacy."]


N = len(raw_texts) # Number of raw texts
M = 2  # M-shot example
max_model_len = 4096 # max squence len of the LM you intend to pre-train
max_new_tokens = 400 # max number of tokens for the augmented instruction-response pairs

# Create a sampling params object.
sampling_params = SamplingParams(temperature=0, max_tokens=max_new_tokens)

# Load the model and tokenizer
llm = LLM(model="instruction-pretrain/instruction-synthesizer", max_model_len=max_model_len)

# 1. multi-round inference to get the prediction
prev_examples = []
BSZ = (N+M-1)//M
for round in range(M):
    cur_raw_texts = raw_texts[round*BSZ: (round+1)*BSZ]
    # load data
    split = get_dataset(prev_examples=prev_examples, 
                        cur_raw_texts=cur_raw_texts, 
                        max_model_len=max_model_len,
                        max_new_tokens=max_new_tokens)
    prev_examples = run(split, llm, sampling_params)


# 2. templify the data for subsequent pre-training
instruction_augmented_texts = []
for idx, entry in enumerate(prev_examples):
    texts = cook_pt_entries(read_collection=entry, random_seed=idx+12345) 
                                                # change random seed for each entry for diveristy
    instruction_augmented_texts.extend(texts)

# 3. print out the results
for idx, text in enumerate(instruction_augmented_texts):
    print(f'## Instruction-augmented Text {idx+1}\n{text}\n')

# Now you can use `instruction_augmented_texts` for pre-training!
```


## Citation
If you find our work helpful, please cite us:

Instruction Pre-Training
```bibtex
@article{cheng2024instruction,
  title={Instruction Pre-Training: Language Models are Supervised Multitask Learners},
  author={Cheng, Daixuan and Gu, Yuxian and Huang, Shaohan and Bi, Junyu and Huang, Minlie and Wei, Furu},
  journal={arXiv preprint arXiv:2406.14491},
  year={2024}
}
```

[AdaptLLM](https://huggingface.co/papers/2309.09530)
```bibtex
@inproceedings{
cheng2024adapting,
title={Adapting Large Language Models via Reading Comprehension},
author={Daixuan Cheng and Shaohan Huang and Furu Wei},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=y886UXPEZ0}
}
```