PhelixZhen
/

Algea-VE

Visual Question Answering

text-generation

Model card Files Files and versions

Algea-VE / README.md

PhelixZhen's picture

Update README.md

ac4fe07 verified over 1 year ago

|

history blame contribute delete

834 Bytes

	---
	license: apache-2.0
	language:
	- en
	pipeline_tag: visual-question-answering
	---

	# Algea-VE: A Tiny Multimodal Language Model with Only 0.8B Parameters


	Algea-ve is trained on the LAION-CC-SBU dataset using [algea-550M-base](https://huggingface.co/PhelixZhen/Algae-550M-base) as the base model and fine-tuned on llava_v1_5_mix665k. It uses CLIP ViT-L/14-336 as the visual encoder. The model is very small, requiring only 32GB of VRAM for fine-tuning and 3GB for inference.

	Due to insufficient training of the base model, the current model has some issues with hallucinations and repetition. To address this, I am training a new model that will maintain the same size but offer better performance.

	This model is built based on the llavaphi project. To use the model, please click [here](https://github.com/phelixzhen/Algea-VE).