MILVLG
/

Imp-v1.5-4B-Phi3

@@ -1,96 +0,0 @@
----
-license: apache-2.0
-pipeline_tag: text-generation
-datasets:
-- liuhaotian/LLaVA-Pretrain
-- liuhaotian/LLaVA-Instruct-150K
----
-# 😈 Imp
-> A very small man can cast a very large shadow.
->
-> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;——*George R.R. Martin, A Clash of Kings*
-\[Technical report (coming soon)\]&nbsp;&nbsp;[[Demo](https://xmbot.net/imp/)\]&nbsp;&nbsp;[[Github](https://github.com/MILVLG/imp)\]
-## Introduction
-The Imp project aims to provide a family of  a strong multimodal `small` language models (MSLMs). Our `imp-v1.5-4b` is a strong MSLM with only **4B** parameters, which is build upon a small yet powerful SLM [Phi-3 ](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)(3.8B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
-As shown in the Table below, `imp-v1.5-4b` significantly outperforms the counterparts of similar model sizes on various multimodal benchmarks.
-We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
-## How to use
-**Install dependencies**
-```bash
-pip install transformers # latest version is ok, but we recommend v4.36.0
-pip install -q pillow accelerate einops
-```
-You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
-```Python
-import torch
-from transformers import AutoModelForCausalLM, AutoTokenizer
-from PIL import Image
-torch.set_default_device("cuda")
-#Create model
-model = AutoModelForCausalLM.from_pretrained(
-    "MILVLG/imp-v1.5-4b",
-    torch_dtype=torch.float16,
-    device_map="auto",
-    trust_remote_code=True)
-tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1.5-4b", trust_remote_code=True)
-#Set inputs
-text = "<|user|>\n<image>\nWhat are the colors of the bus in the image?\n<|end|>\n<|assistant|>\n"
-image = Image.open("images/bus.jpg")
-input_ids = tokenizer(text, return_tensors='pt').input_ids
-image_tensor = model.image_preprocess(image)
-#Generate the answer
-output_ids = model.generate(
-    input_ids,
-    max_new_tokens=100,
-    images=image_tensor,
-    use_cache=True)[0]
-print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
-```
-## Model evaluation
-We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
-| Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE |  MME(P) | MMB  |MMB_CN|MM-Vet|
-|:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|:-------:|
-| imp-v1.5-4b| 4B | 81.46 | 63.51 | 77.99|60.16 | 86.86| 1507.7 |73.28  |61.08|44.6|
-<!-- | [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00|  68.40 |58.20| 86.40 | 1476.9 | 66.10 |- |30.2| -->
-## License
-This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.
-## About us
-This project is maintained by the [MILVLG](https://github.com/MILVLG)@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
-## Citation
-If you use our model or refer our work in your studies, please cite:
-```bibtex
-@misc{imp2024,
-  author = {Shao, Zhenwei and Ouyang, Xuecheng and Yu, Zhou and Yu, Jun},
-  title = {Imp: An Emprical Study of Multimodal Small Language Models},
-  year = {2024},
-  url = {https://huggingface.co/MILVLG/imp-v1-3b}
-}
-```

README.md CHANGED Viewed

@@ -1,3 +1,95 @@
 ---
 license: apache-2.0
 ---

 ---
 license: apache-2.0
+pipeline_tag: text-generation
+datasets:
+- liuhaotian/LLaVA-Pretrain
+- liuhaotian/LLaVA-Instruct-150K
 ---
+# 😈 Imp
+> A very small man can cast a very large shadow.
+>
+> &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;——*George R.R. Martin, A Clash of Kings*
+\[Technical report (coming soon)\]&nbsp;&nbsp;[[Demo](https://xmbot.net/imp/)\]&nbsp;&nbsp;[[Github](https://github.com/MILVLG/imp)\]
+## Introduction
+The Imp project aims to provide a family of  a strong multimodal `small` language models (MSLMs). Our ``Imp-v1.5-4B-Phi3`` is a strong MSLM with only **4B** parameters, which is build upon a small yet powerful SLM [Phi-3 ](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct)(3.8B) and a powerful visual encoder [SigLIP ](https://huggingface.co/google/siglip-so400m-patch14-384)(0.4B), and trained on 1M mixed dataset.
+We release our model weights and provide an example below to run our model . Detailed technical report and corresponding training/evaluation code will be released soon on our [GitHub repo](https://github.com/MILVLG/imp). We will persistently improve our model and release the next versions to further improve model performance :)
+## How to use
+**Install dependencies**
+```bash
+pip install transformers # latest version is ok, but we recommend v4.36.0
+pip install -q pillow accelerate einops
+```
+You can use the following code for model inference. The format of text instruction is similar to [LLaVA](https://github.com/haotian-liu/LLaVA). A Colab page to run this example is provided [here](https://colab.research.google.com/drive/1EBYky6xIPjnlPppo2gZaiNK6gEsjXgom?usp=drive_link#scrollTo=2-VpU6QzWCVZ). Note that the example can only be run on GPUs currently.
+```Python
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+from PIL import Image
+torch.set_default_device("cuda")
+#Create model
+model = AutoModelForCausalLM.from_pretrained(
+    "MILVLG/imp-v1.5-4b",
+    torch_dtype=torch.float16,
+    device_map="auto",
+    trust_remote_code=True)
+tokenizer = AutoTokenizer.from_pretrained("MILVLG/imp-v1.5-4b", trust_remote_code=True)
+#Set inputs
+text = "<|user|>\n<image>\nWhat are the colors of the bus in the image?\n<|end|>\n<|assistant|>\n"
+image = Image.open("images/bus.jpg")
+input_ids = tokenizer(text, return_tensors='pt').input_ids
+image_tensor = model.image_preprocess(image)
+#Generate the answer
+output_ids = model.generate(
+    input_ids,
+    max_new_tokens=100,
+    images=image_tensor,
+    use_cache=True)[0]
+print(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())
+```
+## Model evaluation
+We conduct evaluation on 9 commonly-used benchmarks, including 5 academic VQA benchmarks and 4 popular MLLM benchmarks, to compare our Imp model with LLaVA (7B) and existing MSLMs of similar model sizes.
+| Models | Size | VQAv2 | GQA | SQA(IMG) | TextVQA | POPE |  MME(P) | MMB  |MMB_CN|MM-Vet|
+|:--------:|:-----:|:----:|:-------------:|:--------:|:-----:|:----:|:-------:|:-------:|:-------:|:-------:|
+| Imp-v1.5-4B-Phi3| 4B | 81.46 | 63.51 | 77.99|60.16 | 86.86| 1507.7 |73.28  |61.08|44.6|
+<!-- | [LLaVA-v1.5-lora](https://huggingface.co/liuhaotian/llava-v1.5-7b) | 7B |79.10 | 63.00|  68.40 |58.20| 86.40 | 1476.9 | 66.10 |- |30.2| -->
+## License
+This project is licensed under the Apache License 2.0 - see the [LICENSE](https://www.apache.org/licenses/LICENSE-2.0) file for details.
+## About us
+This project is maintained by the [MILVLG](https://github.com/MILVLG)@Hangzhou Dianzi University (HDU) led by Prof. Zhou Yu and Jun Yu, and is mainly developed by Zhenwei Shao and Xuecheng Ouyang. We hope our model may serve as a strong baseline to inspire future research on MSLM, as well as its derivative applications on mobile devices and robots.
+## Citation
+If you use our model or refer our work in your studies, please cite:
+```bibtex
+@misc{imp2024,
+  author = {Shao, Zhenwei and Ouyang, Xuecheng and Yu, Zhou and Yu, Jun},
+  title = {Imp: An Emprical Study of Multimodal Small Language Models},
+  year = {2024},
+  url = {https://huggingface.co/MILVLG/imp-v1-3b}
+}
+```

config.json CHANGED Viewed

@@ -1,5 +1,5 @@
 {
-  "_name_or_path": "MILVLG/imp-v1.5-4b",
   "activation_function": "gelu_new",
   "architectures": [
     "ImpPhi3ForCausalLM"

 {
+  "_name_or_path": "MILVLG/Imp-v1.5-4B-Phi3",
   "activation_function": "gelu_new",
   "architectures": [
     "ImpPhi3ForCausalLM"