File size: 11,272 Bytes

---
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
datasets:
- Dataseeds/DataSeeds-Sample-Dataset-DSD
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- llava
- llava-onevision
- lora
- fine-tuned
- photography
- scene-analysis
- image-captioning
model-index:
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
  results:
  - task:
      type: image-captioning
      name: Image Captioning
    dataset:
      name: DataSeeds.AI Sample Dataset
      type: Dataseeds/DataSeeds-Sample-Dataset-DSD
    metrics:
    - type: bleu-4
      value: 0.0246
      name: BLEU-4
    - type: rouge-l
      value: 0.214
      name: ROUGE-L
    - type: bertscore
      value: 0.2789
      name: BERTScore F1
    - type: clipscore
      value: 0.326
      name: CLIPScore
---

# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset

This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.

Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava


## Model Description

- **Base Model**: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov)
- **Vision Encoder**: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
- **Language Model**: Qwen2-0.5B (896M parameters)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with PEFT
- **Total Parameters**: ~917M (513M trainable during fine-tuning, 56% of total)
- **Multimodal Projector**: 1.84M parameters (100% trainable)
- **Precision**: BFloat16
- **Task**: Photography scene analysis and detailed image description

### LoRA Configuration

- **LoRA Rank (r)**: 32
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.1
- **Target Modules**: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj`
- **Tunable Components**: `mm_mlp_adapter`, `mm_language_model`

## Training Details

### Dataset
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
- Compositional elements and camera perspectives
- Lighting conditions and visual ambiance
- Product identification and technical details
- Photographic style and mood analysis

### Training Configuration

| Parameter | Value |
|-----------|-------|
| **Learning Rate** | 1e-5 |
| **Optimizer** | AdamW |
| **Learning Rate Schedule** | Cosine decay |
| **Warmup Ratio** | 0.03 |
| **Weight Decay** | 0.01 |
| **Batch Size** | 2 |
| **Gradient Accumulation Steps** | 8 (effective batch size: 16) |
| **Training Epochs** | 3 |
| **Max Sequence Length** | 8192 |
| **Max Gradient Norm** | 0.5 |
| **Precision** | BFloat16 |
| **Hardware** | Single NVIDIA A100 40GB |
| **Training Time** | 30 hours |

### Training Strategy
- **Validation Frequency**: Every 50 steps for precise checkpoint selection
- **Best Checkpoint**: Step 1,750 (epoch 2.9) with validation loss of 1.83
- **Mixed Precision**: BFloat16 with gradient checkpointing for memory efficiency
- **System Prompt**: Consistent template requesting scene descriptions across all samples

## Performance

### Quantitative Results

The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:

| Metric | Base Model | Fine-tuned | Absolute Δ | Relative Δ |
|--------|------------|------------|------------|------------|
| **BLEU-4** | 0.0199 | **0.0246** | +0.0048 | **+24.09%** |
| **ROUGE-L** | 0.2089 | **0.2140** | +0.0051 | **+2.44%** |
| **BERTScore F1** | 0.2751 | **0.2789** | +0.0039 | **+1.40%** |
| **CLIPScore** | 0.3247 | **0.3260** | +0.0013 | **+0.41%** |

### Key Improvements
- **Enhanced N-gram Precision**: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
- **Better Sequential Information**: ROUGE-L improvement shows enhanced capture of longer matching sequences
- **Improved Semantic Understanding**: BERTScore gains demonstrate better contextual relationships
- **Maintained Visual-Semantic Alignment**: CLIPScore preservation with slight improvement

### Inference Performance
- **Processing Speed**: 2.30 seconds per image (NVIDIA A100 40GB)
- **Memory Requirements**: Optimized for single GPU inference

## Usage

### Installation

```bash
pip install transformers torch peft pillow
```

### Basic Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image

# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
    "lmms-lab/llava-onevision-qwen2-0.5b-ov",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)

# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."

inputs = processor(prompt, image, return_tensors="pt").to(model.device)

# Generate description
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)
```

### Advanced Usage with Custom Prompts

```python
# Photography-specific prompts that work well with this model
prompts = [
    "Analyze the photographic composition and lighting in this image.",
    "Describe the technical aspects and visual mood of this photograph.",
    "Provide a detailed scene description focusing on the subject and environment."
]

for prompt in prompts:
    inputs = processor(prompt, image, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    description = processor.decode(outputs[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Description: {description}
")
```

## Model Architecture

The model maintains the LLaVA-OneVision architecture with the following components:

- **Vision Encoder**: SigLIP-SO400M with hierarchical feature extraction
- **Language Model**: Qwen2-0.5B with 24 layers, 14 attention heads
- **Multimodal Projector**: 2-layer MLP with GELU activation (mlp2x_gelu)
- **Image Processing**: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
- **Context Length**: 32,768 tokens with sliding window attention

### Technical Specifications

- **Hidden Size**: 896
- **Intermediate Size**: 4,864
- **Attention Heads**: 14 (2 key-value heads)
- **RMS Norm Epsilon**: 1e-6
- **RoPE Theta**: 1,000,000
- **Image Token Index**: 151646
- **Max Image Grid**: Up to 2304×2304 pixels with dynamic tiling

## Training Data

The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:

- **Scene Descriptions**: Detailed textual descriptions of visual content
- **Technical Metadata**: Camera settings, composition details
- **Style Analysis**: Photographic techniques and artistic elements
- **Quality Annotations**: Professional photography standards

The dataset focuses on enhancing the model's ability to:
- Identify specific products and technical details accurately
- Describe lighting conditions and photographic ambiance
- Analyze compositional elements and camera perspectives
- Generate contextually aware scene descriptions

## Limitations and Considerations

### Model Limitations
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
- **Training Data Bias**: May reflect biases present in the DataSeeds.AI dataset
- **Language Support**: Primarily trained and evaluated on English descriptions

### Recommended Use Cases
- ✅ Photography scene analysis and description
- ✅ Product photography captioning
- ✅ Technical photography analysis
- ✅ Visual content generation for photography applications
- ⚠️ General-purpose vision-language tasks (may have reduced performance)
- ❌ Non-photographic image analysis (not optimized for this use case)

### Ethical Considerations
- The model may perpetuate biases present in photography datasets
- Generated descriptions should be reviewed for accuracy in critical applications
- Consider potential cultural biases in photographic style interpretation

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@article{abdoli2025peerranked,
    title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery}, 
    author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
    journal={arXiv preprint arXiv:2506.05673},
    year={2025},
}

@misc{llava-onevision-dsd-finetune-2024,
  title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
  author={DataSeeds.AI},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
  note={LoRA fine-tuned model for enhanced photography description generation}
}

@article{li2024llavaonevision,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
  journal={arXiv preprint arXiv:2408.03326},
  year={2024}
}

@article{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  journal={arXiv preprint arXiv:2106.09685},
  year={2021}
}
```

## License

This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.

## Acknowledgments

- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
- **Dataset**: GuruShots photography community for the source imagery
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities

---

*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*