LLaVA 7B — Multimodal Supervised Fine-Tuning (CPT-SFT)

Model type: Vision-Language Causal Model
Base model: ubitech-edg/llava-7b-cpt
License: Llama 2 Community License
Framework: Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)

Overview

llava-7b-cpt-sft is the final multimodal supervised fine-tuned version of LLaVA 1.5 7B.
It builds upon the multimodal continual-pretrained model (ubitech-edg/llava-7b-cpt), combining rich visual grounding with instruction-following and question-answering abilities.

This stage refines both the text and image reasoning layers using synthetic QA data while retaining the full multimodal processor and vision encoder.

Training was performed on the Leonardo EuroHPC supercomputer using Axolotl and DeepSpeed ZeRO-1 with bfloat16 precision and LoRA adapters merged into the final weights.

Training Setup

Component	Specification
Objective	Multimodal supervised fine-tuning (image–text QA)
Base model	`ubitech-edg/llava-7b-cpt`
Adapter type	LoRA (merged into full model)
Precision	bfloat16
Hardware	8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework	Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 / CUDA 12.1)
Runtime	~24 hours
Checkpoints	1 per epoch
Vision tower	Active (unfrozen multimodal processing)
Dataset split	70% train / 30% validation

Dataset

This multimodal SFT stage uses the synthetic QA dataset for text reasoning and may optionally pair visual data from prior continual pretraining.

Dataset	Description
`axolotl_deduplicated_synthetic_qa.jsonl`	Text-based instruction-following and question-answering dataset
`mm_captions_chat.jsonl`	Image–caption dialogues, aligning visual grounding with natural language

Together, these datasets enhance visual question answering, caption reasoning, and multimodal instruction following.

Hyperparameters

Parameter	Value
Sequence length	2048
Micro batch size	1
Gradient accumulation	4
Epochs	1
Learning rate	0.00015
LR scheduler	cosine
Optimizer	AdamW (8-bit)
Warmup steps	10
Weight decay	0.0
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj`
Gradient checkpointing	✅
Flash attention	❌ (disabled for multimodal stability)
Validation set size	0.3
Evals per epoch	1
Image size	512
Resize algorithm	bilinear

Tokenizer & Processor

Component	Description
Tokenizer type	`AutoTokenizer`
Processor type	`AutoProcessor`
Pad token	`<pad>` (ID 32001)
Chat template	`llava`

The processor is fully multimodal, handling both image and text inputs with unified preprocessing.

Usage Example

Perform visual question answering or image–text chat directly with transformers:

import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM

model_id = "ubitech-edg/llava-7b-cpt-sft"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe what is happening in this image.\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=150, temperature=0.7, top_p=0.9)

print(processor.decode(output[0], skip_special_tokens=True))

Downloads last month: 22

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for ubitech-edg/llava-7b-cpt-sft

Base model

llava-hf/llava-1.5-7b-hf

Adapter

ubitech-edg/llava-7b-cpt

Adapter

(1)

this model