LLaVA 7B — Multimodal Supervised Fine-Tuning (CPT-SFT)
Model type: Vision-Language Causal Model
Base model: ubitech-edg/llava-7b-cpt
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)
Overview
llava-7b-cpt-sft is the final multimodal supervised fine-tuned version of LLaVA 1.5 7B.
It builds upon the multimodal continual-pretrained model (ubitech-edg/llava-7b-cpt), combining rich visual grounding with instruction-following and question-answering abilities.
This stage refines both the text and image reasoning layers using synthetic QA data while retaining the full multimodal processor and vision encoder.
Training was performed on the Leonardo EuroHPC supercomputer using Axolotl and DeepSpeed ZeRO-1 with bfloat16 precision and LoRA adapters merged into the final weights.
Training Setup
| Component | Specification |
|---|---|
| Objective | Multimodal supervised fine-tuning (image–text QA) |
| Base model | ubitech-edg/llava-7b-cpt |
| Adapter type | LoRA (merged into full model) |
| Precision | bfloat16 |
| Hardware | 8 nodes × 2 × NVIDIA A100 64 GB GPUs |
| Framework | Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 / CUDA 12.1) |
| Runtime | ~24 hours |
| Checkpoints | 1 per epoch |
| Vision tower | Active (unfrozen multimodal processing) |
| Dataset split | 70% train / 30% validation |
Dataset
This multimodal SFT stage uses the synthetic QA dataset for text reasoning and may optionally pair visual data from prior continual pretraining.
| Dataset | Description |
|---|---|
axolotl_deduplicated_synthetic_qa.jsonl |
Text-based instruction-following and question-answering dataset |
mm_captions_chat.jsonl |
Image–caption dialogues, aligning visual grounding with natural language |
Together, these datasets enhance visual question answering, caption reasoning, and multimodal instruction following.
Hyperparameters
| Parameter | Value |
|---|---|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Learning rate | 0.00015 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup steps | 10 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ❌ (disabled for multimodal stability) |
| Validation set size | 0.3 |
| Evals per epoch | 1 |
| Image size | 512 |
| Resize algorithm | bilinear |
Tokenizer & Processor
| Component | Description |
|---|---|
| Tokenizer type | AutoTokenizer |
| Processor type | AutoProcessor |
| Pad token | <pad> (ID 32001) |
| Chat template | llava |
The processor is fully multimodal, handling both image and text inputs with unified preprocessing.
Usage Example
Perform visual question answering or image–text chat directly with transformers:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
model_id = "ubitech-edg/llava-7b-cpt-sft"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe what is happening in this image.\nASSISTANT:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=150, temperature=0.7, top_p=0.9)
print(processor.decode(output[0], skip_special_tokens=True))
- Downloads last month
- 22