Create README.md
Browse files
README.md
ADDED
|
@@ -0,0 +1,155 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
pipeline_tag: text-generation
|
| 3 |
+
license: other
|
| 4 |
+
language:
|
| 5 |
+
- en
|
| 6 |
+
- zh
|
| 7 |
+
tags:
|
| 8 |
+
- math
|
| 9 |
+
---
|
| 10 |
+
|
| 11 |
+
# InternLM-Math
|
| 12 |
+
|
| 13 |
+
<div align="center">
|
| 14 |
+
|
| 15 |
+
<img src="https://raw.githubusercontent.com/InternLM/InternLM/main/assets/logo.svg" width="200"/>
|
| 16 |
+
<div> </div>
|
| 17 |
+
<div align="center">
|
| 18 |
+
<b><font size="5">InternLM-Math</font></b>
|
| 19 |
+
<sup>
|
| 20 |
+
<a href="https://internlm.intern-ai.org.cn/">
|
| 21 |
+
<i><font size="4">HOT</font></i>
|
| 22 |
+
</a>
|
| 23 |
+
</sup>
|
| 24 |
+
<div> </div>
|
| 25 |
+
</div>
|
| 26 |
+
|
| 27 |
+
State-of-the-art bilingual open-sourced Math reasoning LLMs.
|
| 28 |
+
</div>
|
| 29 |
+
|
| 30 |
+
# Introduction
|
| 31 |
+
- **7B and 20B Chinese and English Math LMs with better than ChatGPT performances.** InternLM2-Math are continued pretrained from InternLM2-Base with ~100B high quality math-related tokens and SFT with ~2M bilingual math supervised data. We apply minhash and exact number match to decontaminate possible test set leakage.
|
| 32 |
+
- **Add Lean as a support language for math problem solving and math theorem proving.** We are exploring combining Lean 3 with InternLM-Math for verifiable math reasoning. InternLM-Math can generate Lean codes for simple math reasoning tasks like GSM8K or provide possible proof tactics based on Lean states.
|
| 33 |
+
- **Also can be viewed as a reward model, which supports the Outcome/Process/Lean Reward Model.** We supervise InternLM2-Math with various types of reward modeling data, to make InternLM2-Math can also verify chain-of-thought processes. We also add the ability to convert a chain-of-thought process into Lean 3 code.
|
| 34 |
+
- **A Math LM Augment Helper** and **Code Intepreter**. InternLM2-Math can help augment math reasoning problems and solve them using the code interpreter which makes you generate synthesis data quicker!
|
| 35 |
+
|
| 36 |
+
# Models
|
| 37 |
+
| Model | Transformers(HF) |Release Date |
|
| 38 |
+
|---|---|---|
|
| 39 |
+
| **InternLM2-Math-Base-7B** | [🤗internlm/internlm2-math-base-7b](https://huggingface.co/internlm/internlm2-math-base-7b) | 2024-01-23|
|
| 40 |
+
| **InternLM2-Math-Base-20B** | [🤗internlm/internlm2-math-base-20b](https://huggingface.co/internlm/internlm2-math-base-20b) | 2024-01-23|
|
| 41 |
+
| **InternLM2-Math-7B** | [🤗internlm/internlm2-math-7b](https://huggingface.co/internlm/internlm2-math-7b) | 2024-01-23|
|
| 42 |
+
| **InternLM2-Math-20B** | [🤗internlm/internlm2-math-20b](https://huggingface.co/internlm/internlm2-math-20b) | 2024-01-23|
|
| 43 |
+
|
| 44 |
+
|
| 45 |
+
# Performance
|
| 46 |
+
|
| 47 |
+
## Pretrain Performance
|
| 48 |
+
We evaluate pretrain checkpoints based on greedy decoding with few-shot COT. Details of pretraining will be introduced in the tech report.
|
| 49 |
+
| Model | GSM8K | MATH |
|
| 50 |
+
|------------------------|---------|--------|
|
| 51 |
+
| Llama2-7B | 11.8 | 3.2 |
|
| 52 |
+
| Llemma-7B | 36.4 | 18.0 |
|
| 53 |
+
| InternLM2-Base-7B | 36.5 | 8.6 |
|
| 54 |
+
| **InternLM2-Math-Base-7B** | **49.2** | **21.5** |
|
| 55 |
+
| Minerva-8B | 16.2 | 14.1 |
|
| 56 |
+
| InternLM2-Base-20B | 54.6 | 13.7 |
|
| 57 |
+
| **InternLM2-Math-Base-20B** | **63.7** | **27.3** |
|
| 58 |
+
| Llemma-34B | 51.5 | 25.0 |
|
| 59 |
+
| Minerva-62B | 52.4 | 27.6 |
|
| 60 |
+
| Minerva-540B | 58.8 | 33.6 |
|
| 61 |
+
|
| 62 |
+
|
| 63 |
+
## SFT Peformance
|
| 64 |
+
All performance is based on greedy decoding with COT. We notice that the performance of Hungary has a big variance between our different checkpoints, while other performance is very stable. This may be due to the problem amount about Hungary.
|
| 65 |
+
| Model | Model Type | GSM8K | MATH | Hungary |
|
| 66 |
+
|------------------------|----------------------|--------|--------|---------|
|
| 67 |
+
| Qwen-7B-Chat | Genearl | 51.7 | 11.6 | - |
|
| 68 |
+
| DeepSeek-7B-Chat | General | 63.0 | 15.8 | 28.5 |
|
| 69 |
+
| InternLM2-Chat-7B | General | 70.7 | 23.0 | - |
|
| 70 |
+
| ChatGLM3-6B | General | 53.8 | 20.4 | 32 |
|
| 71 |
+
| MetaMath-Mistral-7B | Mathematics | 77.7 | 28.2 | 29 |
|
| 72 |
+
| MetaMath-Llemma-7B | Mathematics | 69.2 | 30.0 | - |
|
| 73 |
+
| **InternLM2-Math-7B** | Mathematics | **78.1** | **34.6** | **55** |
|
| 74 |
+
| InternLM2-Chat-20B | General | 79.6 | 31.9 | - |
|
| 75 |
+
| MetaMath-Llemma-34B | Mathematics | 75.8 | 34.8 | - |
|
| 76 |
+
| **InternLM2-Math-20B** | Mathematics | **82.6** | **37.7** | **66** |
|
| 77 |
+
| Qwen-72B | General | 78.9 | 35.2 | 52 |
|
| 78 |
+
| DeepSeek-67B | General | 84.1 | 32.6 | 58 |
|
| 79 |
+
| ChatGPT (GPT-3.5) | General | 80.8 | 34.1 | 41 |
|
| 80 |
+
| GPT4 (First version) | General | 92.0 | 42.5 | 68 |
|
| 81 |
+
|
| 82 |
+
# Inference
|
| 83 |
+
|
| 84 |
+
## LMDeploy
|
| 85 |
+
We suggest using [LMDeploy](https://github.com/InternLM/LMDeploy)(>=0.2.1) for inference.
|
| 86 |
+
```python
|
| 87 |
+
from lmdeploy import pipeline, TurbomindEngineConfig, ChatTemplateConfig
|
| 88 |
+
|
| 89 |
+
backend_config = TurbomindEngineConfig(model_name='internlm2-chat-7b', tp=1, cache_max_entry_count=0.3)
|
| 90 |
+
chat_template = ChatTemplateConfig(model_name='internlm2-chat-7b',
|
| 91 |
+
system='',
|
| 92 |
+
eosys='',
|
| 93 |
+
meta_instruction='',
|
| 94 |
+
user='<|im_start|>user\n',
|
| 95 |
+
assistant='<|im_start|>assistant\n',
|
| 96 |
+
eoh='<|im_end|>\n',
|
| 97 |
+
eoa='<|im_end|>\n',
|
| 98 |
+
stop_words=['<|im_end|>', '<|action_end|>'])
|
| 99 |
+
pipe = pipeline(model_path='internlm/internlm2-math-7b',
|
| 100 |
+
chat_template_config=chat_template,
|
| 101 |
+
backend_config=backend_config)
|
| 102 |
+
|
| 103 |
+
problem = '1+1='
|
| 104 |
+
result = pipe([problem], request_output_len=1024, top_k=1)
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
## Huggingface
|
| 108 |
+
```python
|
| 109 |
+
import torch
|
| 110 |
+
from transformers import AutoTokenizer, AutoModelForCausalLM
|
| 111 |
+
tokenizer = AutoTokenizer.from_pretrained("internlm/internlm2-math-7b", trust_remote_code=True)
|
| 112 |
+
# Set `torch_dtype=torch.float16` to load model in float16, otherwise it will be loaded as float32 and might cause OOM Error.
|
| 113 |
+
model = AutoModelForCausalLM.from_pretrained("internlm/internlm2-math-7b", trust_remote_code=True, torch_dtype=torch.float16).cuda()
|
| 114 |
+
model = model.eval()
|
| 115 |
+
response, history = model.chat(tokenizer, "1+1=", history=[])
|
| 116 |
+
print(response)
|
| 117 |
+
```
|
| 118 |
+
|
| 119 |
+
# Special usages
|
| 120 |
+
We list some instructions used in our SFT. You can use them to help you. You can use the other ways to prompt the model, but the following are recommended. InternLM2-Math may combine the following abilities but it is not guaranteed.
|
| 121 |
+
|
| 122 |
+
| Description | Query |
|
| 123 |
+
| --- | --- |
|
| 124 |
+
| Solving question via chain-of-thought | {Question} |
|
| 125 |
+
| Solving question via Lean 3 | {Question}\nSolve this via Lean 3 |
|
| 126 |
+
| Outcome reward model | Given a question and an answer, check is it correct?\nQuestion:{Question}\nAnswer:{COT} |
|
| 127 |
+
| Process reward model | Given a question and an answer, check correctness of each step.\nQuestion:{Question}\nAnswer:{COT} |
|
| 128 |
+
| Reward model | Given a question and two answers, which one is better? \nQuestion:{Question}\nAnswer 1:{COT}\nAnswer 2:{COT} |
|
| 129 |
+
| Convert chain-of-thought to Lean 3 | Convert this answer into Lean3. Question:{Question}\nAnswer:{COT} |
|
| 130 |
+
| Convert Lean 3 to chain-of-thought | Convert this lean 3 code into a natural language problem with answers:\n{LEAN} |
|
| 131 |
+
| Translate question and chain-of-thought answer to a proof statement | Convert this question and answer into a proof format.\nQuestion:{Question}\nAnswer:{COT} |
|
| 132 |
+
| Translate proof problem to Lean 3 | Convert this natural langauge statement into a Lean 3 theorem statement:{Theorem} |
|
| 133 |
+
| Translate Lean 3 to proof problem | Convert this Lean 3 theorem statement into natural language:{STATEMENT} |
|
| 134 |
+
| Suggest a tactic based on Lean state | Given the Lean 3 tactic state, suggest a next tactic:\n{State} |
|
| 135 |
+
| Rephrase Problem | Describe this problem in another way. {STATEMENT} |
|
| 136 |
+
| Augment Problem | Please augment a new problem based on: {Question} |
|
| 137 |
+
| Augment a harder Problem | Increase the complexity of the problem: {Question} |
|
| 138 |
+
| Change specific numbers | Change specific numbers: {Question}|
|
| 139 |
+
| Introduce fractions or percentages | Introduce fractions or percentages: {Question}|
|
| 140 |
+
| Code Intepreter | [lagent](https://github.com/InternLM/InternLM/blob/main/agent/lagent.md) |
|
| 141 |
+
| In-context Learning | Question:{Question}\nAnswer:{COT}\n...Question:{Question}\nAnswer:{COT}|
|
| 142 |
+
|
| 143 |
+
# Fine-tune and others
|
| 144 |
+
Please refer to [InternLM](https://github.com/InternLM/InternLM/tree/main).
|
| 145 |
+
|
| 146 |
+
# Known issues
|
| 147 |
+
Our model is still under development and will be upgraded. There are some possible issues of InternLM-Math.
|
| 148 |
+
- Jump the calculating step.
|
| 149 |
+
- Perform badly at Chinese fill-in-the-bank problems and English choice problems due to SFT data composition.
|
| 150 |
+
- The reward model mode can be better leveraged with assigned token probabilities.
|
| 151 |
+
- Code switch due to SFT data composition.
|
| 152 |
+
- Some abilities of Lean can only be adapted to GSM8K-like problems (e.g. Convert chain-of-thought to Lean 3), and performance related to Lean is not guaranteed.
|
| 153 |
+
|
| 154 |
+
# Citation and Tech Report
|
| 155 |
+
To be appended.
|