Update README
Browse files
README.md
CHANGED
|
@@ -5,4 +5,52 @@ pipeline_tag: text-generation
|
|
| 5 |
tags:
|
| 6 |
- VILA
|
| 7 |
- VLM
|
| 8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 5 |
tags:
|
| 6 |
- VILA
|
| 7 |
- VLM
|
| 8 |
+
---
|
| 9 |
+
|
| 10 |
+
# VILA Model Card
|
| 11 |
+
|
| 12 |
+
## Model details
|
| 13 |
+
|
| 14 |
+
**Model type:**
|
| 15 |
+
VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling multi-image VLM. VILA is deployable on the edge, including Jetson Orin and laptop by AWQ 4bit quantization through TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance. VILA unveils appealing capabilities, including: multi-image reasoning, in-context learning, visual chain-of-thought, and better world knowledge.
|
| 16 |
+
|
| 17 |
+
**Model date:**
|
| 18 |
+
VILA-13b-4bit-awq was trained in Feb 2024.
|
| 19 |
+
|
| 20 |
+
**Paper or resources for more information:**
|
| 21 |
+
https://github.com/Efficient-Large-Model/VILA
|
| 22 |
+
|
| 23 |
+
```
|
| 24 |
+
@misc{lin2023vila,
|
| 25 |
+
title={VILA: On Pre-training for Visual Language Models},
|
| 26 |
+
author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
|
| 27 |
+
year={2023},
|
| 28 |
+
eprint={2312.07533},
|
| 29 |
+
archivePrefix={arXiv},
|
| 30 |
+
primaryClass={cs.CV}
|
| 31 |
+
}
|
| 32 |
+
```
|
| 33 |
+
|
| 34 |
+
## License
|
| 35 |
+
- The code is released under the Apache 2.0 license as found in the [LICENSE](./LICENSE) file.
|
| 36 |
+
- The pretrained weights are released under the [CC-BY-NC-SA-4.0 license](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en).
|
| 37 |
+
- The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
|
| 38 |
+
- [Model License](https://github.com/facebookresearch/llama/blob/main/MODEL_CARD.md) of LLaMA
|
| 39 |
+
- [Terms of Use](https://openai.com/policies/terms-of-use) of the data generated by OpenAI
|
| 40 |
+
- [Dataset Licenses](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/LICENSE) for each one used during training.
|
| 41 |
+
|
| 42 |
+
**Where to send questions or comments about the model:**
|
| 43 |
+
https://github.com/Efficient-Large-Model/VILA/issues
|
| 44 |
+
|
| 45 |
+
## Intended use
|
| 46 |
+
**Primary intended uses:**
|
| 47 |
+
The primary use of VILA is research on large multimodal models and chatbots.
|
| 48 |
+
|
| 49 |
+
**Primary intended users:**
|
| 50 |
+
The primary intended users of the model are researchers and hobbyists in computer vision, natural language processing, machine learning, and artificial intelligence.
|
| 51 |
+
|
| 52 |
+
## Training dataset
|
| 53 |
+
See [Dataset Preparation](https://github.com/Efficient-Large-Model/VILA/blob/main/data_prepare/README.md) for more details.
|
| 54 |
+
|
| 55 |
+
## Evaluation dataset
|
| 56 |
+
A collection of 12 benchmarks, including 5 academic VQA benchmarks and 7 recent benchmarks specifically proposed for instruction-following LMMs.
|