LLaVA 7B — Multimodal Supervised Fine-Tuning (CPT-SFT)

Model type: Vision-Language Causal Model
Base model: ubitech-edg/llava-7b-cpt
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)


Overview

llava-7b-cpt-sft is the final multimodal supervised fine-tuned version of LLaVA 1.5 7B.
It builds upon the multimodal continual-pretrained model (ubitech-edg/llava-7b-cpt), combining rich visual grounding with instruction-following and question-answering abilities.

This stage refines both the text and image reasoning layers using synthetic QA data while retaining the full multimodal processor and vision encoder.

Training was performed on the Leonardo EuroHPC supercomputer using Axolotl and DeepSpeed ZeRO-1 with bfloat16 precision and LoRA adapters merged into the final weights.


Training Setup

Component Specification
Objective Multimodal supervised fine-tuning (image–text QA)
Base model ubitech-edg/llava-7b-cpt
Adapter type LoRA (merged into full model)
Precision bfloat16
Hardware 8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 / CUDA 12.1)
Runtime ~24 hours
Checkpoints 1 per epoch
Vision tower Active (unfrozen multimodal processing)
Dataset split 70% train / 30% validation

Dataset

This multimodal SFT stage uses the synthetic QA dataset for text reasoning and may optionally pair visual data from prior continual pretraining.

Dataset Description
axolotl_deduplicated_synthetic_qa.jsonl Text-based instruction-following and question-answering dataset
mm_captions_chat.jsonl Image–caption dialogues, aligning visual grounding with natural language

Together, these datasets enhance visual question answering, caption reasoning, and multimodal instruction following.


Hyperparameters

Parameter Value
Sequence length 2048
Micro batch size 1
Gradient accumulation 4
Epochs 1
Learning rate 0.00015
LR scheduler cosine
Optimizer AdamW (8-bit)
Warmup steps 10
Weight decay 0.0
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Gradient checkpointing
Flash attention ❌ (disabled for multimodal stability)
Validation set size 0.3
Evals per epoch 1
Image size 512
Resize algorithm bilinear

Tokenizer & Processor

Component Description
Tokenizer type AutoTokenizer
Processor type AutoProcessor
Pad token <pad> (ID 32001)
Chat template llava

The processor is fully multimodal, handling both image and text inputs with unified preprocessing.


Usage Example

Perform visual question answering or image–text chat directly with transformers:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

model_id = "ubitech-edg/llava-7b-cpt-sft"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe what is happening in this image.\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=150, temperature=0.7, top_p=0.9)

print(processor.decode(output[0], skip_special_tokens=True))
Downloads last month
22
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ubitech-edg/llava-7b-cpt-sft

Adapter
(1)
this model