Update README.md
Browse files
README.md
CHANGED
|
@@ -33,11 +33,17 @@ To further enhance the multimodal capabilities of the model, we use learnable cu
|
|
| 33 |
|
| 34 |
1. Pre-training the adapter on Image Captioning tasks (LAION, CC-4M, etc.).
|
| 35 |
2. Once the adapter has learned to map visual embeddings to the language model's textual space, we proceed to unfreeze Mistral for improved understanding of dialog formats and complex queries.
|
| 36 |
-
3. The dataset consists of data in English and Russian
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 41 |
|
| 42 |
### Results
|
| 43 |
|
|
|
|
| 33 |
|
| 34 |
1. Pre-training the adapter on Image Captioning tasks (LAION, CC-4M, etc.).
|
| 35 |
2. Once the adapter has learned to map visual embeddings to the language model's textual space, we proceed to unfreeze Mistral for improved understanding of dialog formats and complex queries.
|
| 36 |
+
3. The dataset consists of data in English and Russian and has the following structure:
|
| 37 |
+
|
| 38 |
+
| Task | Dataset Source | #Samples |
|
| 39 |
+
| --------------| ---------------------------------- | --------- |
|
| 40 |
+
| Caption | ShareGPT4V | 100K |
|
| 41 |
+
| VQA | COCO, SAM-9K | 20K, 9K |
|
| 42 |
+
| WebQA | WebData | 1.5K |
|
| 43 |
+
| OCRQA | TextVQA, OCRVQA | 120K |
|
| 44 |
+
| Conversation | LLaVA-v1.5-665K, OCRVQA | 665K |
|
| 45 |
+
| DocVQA | Proprietary data (ru) | 20K |
|
| 46 |
+
| Text-only SFT | Proprietary data (ru), Alpaca (en) | 10K |
|
| 47 |
|
| 48 |
### Results
|
| 49 |
|