File size: 11,272 Bytes
b8414fe c328015 b8414fe 1965c3a b8414fe 1965c3a 807e54d 1965c3a 57e3103 c328015 1965c3a c328015 1965c3a c328015 1965c3a b8414fe 57e3103 b8414fe 432d75f b8414fe 685c4a5 b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a 432d75f 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 1965c3a b8414fe 807e54d b8414fe 1965c3a b8414fe 1965c3a c328015 b8414fe 1965c3a b8414fe 432d75f 1965c3a b8414fe 1965c3a b8414fe 1965c3a 432d75f 1965c3a b8414fe 1965c3a b8414fe 00eada5 c328015 00eada5 095d094 00eada5 807e54d 383aac5 432d75f b8414fe 807e54d 1965c3a b8414fe 1965c3a b8414fe 1965c3a 4cb6ae2 1965c3a c328015 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 |
---
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
datasets:
- Dataseeds/DataSeeds-Sample-Dataset-DSD
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- llava
- llava-onevision
- lora
- fine-tuned
- photography
- scene-analysis
- image-captioning
model-index:
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
results:
- task:
type: image-captioning
name: Image Captioning
dataset:
name: DataSeeds.AI Sample Dataset
type: Dataseeds/DataSeeds-Sample-Dataset-DSD
metrics:
- type: bleu-4
value: 0.0246
name: BLEU-4
- type: rouge-l
value: 0.214
name: ROUGE-L
- type: bertscore
value: 0.2789
name: BERTScore F1
- type: clipscore
value: 0.326
name: CLIPScore
---
# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset
This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.
Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava
## Model Description
- **Base Model**: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov)
- **Vision Encoder**: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
- **Language Model**: Qwen2-0.5B (896M parameters)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with PEFT
- **Total Parameters**: ~917M (513M trainable during fine-tuning, 56% of total)
- **Multimodal Projector**: 1.84M parameters (100% trainable)
- **Precision**: BFloat16
- **Task**: Photography scene analysis and detailed image description
### LoRA Configuration
- **LoRA Rank (r)**: 32
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.1
- **Target Modules**: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj`
- **Tunable Components**: `mm_mlp_adapter`, `mm_language_model`
## Training Details
### Dataset
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
- Compositional elements and camera perspectives
- Lighting conditions and visual ambiance
- Product identification and technical details
- Photographic style and mood analysis
### Training Configuration
| Parameter | Value |
|-----------|-------|
| **Learning Rate** | 1e-5 |
| **Optimizer** | AdamW |
| **Learning Rate Schedule** | Cosine decay |
| **Warmup Ratio** | 0.03 |
| **Weight Decay** | 0.01 |
| **Batch Size** | 2 |
| **Gradient Accumulation Steps** | 8 (effective batch size: 16) |
| **Training Epochs** | 3 |
| **Max Sequence Length** | 8192 |
| **Max Gradient Norm** | 0.5 |
| **Precision** | BFloat16 |
| **Hardware** | Single NVIDIA A100 40GB |
| **Training Time** | 30 hours |
### Training Strategy
- **Validation Frequency**: Every 50 steps for precise checkpoint selection
- **Best Checkpoint**: Step 1,750 (epoch 2.9) with validation loss of 1.83
- **Mixed Precision**: BFloat16 with gradient checkpointing for memory efficiency
- **System Prompt**: Consistent template requesting scene descriptions across all samples
## Performance
### Quantitative Results
The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:
| Metric | Base Model | Fine-tuned | Absolute Δ | Relative Δ |
|--------|------------|------------|------------|------------|
| **BLEU-4** | 0.0199 | **0.0246** | +0.0048 | **+24.09%** |
| **ROUGE-L** | 0.2089 | **0.2140** | +0.0051 | **+2.44%** |
| **BERTScore F1** | 0.2751 | **0.2789** | +0.0039 | **+1.40%** |
| **CLIPScore** | 0.3247 | **0.3260** | +0.0013 | **+0.41%** |
### Key Improvements
- **Enhanced N-gram Precision**: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
- **Better Sequential Information**: ROUGE-L improvement shows enhanced capture of longer matching sequences
- **Improved Semantic Understanding**: BERTScore gains demonstrate better contextual relationships
- **Maintained Visual-Semantic Alignment**: CLIPScore preservation with slight improvement
### Inference Performance
- **Processing Speed**: 2.30 seconds per image (NVIDIA A100 40GB)
- **Memory Requirements**: Optimized for single GPU inference
## Usage
### Installation
```bash
pip install transformers torch peft pillow
```
### Basic Usage
```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
"lmms-lab/llava-onevision-qwen2-0.5b-ov",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")
# Load LoRA adapter
model = PeftModel.from_pretrained(
base_model,
"Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)
# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
# Generate description
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=512,
do_sample=True,
temperature=0.7,
top_p=0.9
)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)
```
### Advanced Usage with Custom Prompts
```python
# Photography-specific prompts that work well with this model
prompts = [
"Analyze the photographic composition and lighting in this image.",
"Describe the technical aspects and visual mood of this photograph.",
"Provide a detailed scene description focusing on the subject and environment."
]
for prompt in prompts:
inputs = processor(prompt, image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
description = processor.decode(outputs[0], skip_special_tokens=True)
print(f"Prompt: {prompt}")
print(f"Description: {description}
")
```
## Model Architecture
The model maintains the LLaVA-OneVision architecture with the following components:
- **Vision Encoder**: SigLIP-SO400M with hierarchical feature extraction
- **Language Model**: Qwen2-0.5B with 24 layers, 14 attention heads
- **Multimodal Projector**: 2-layer MLP with GELU activation (mlp2x_gelu)
- **Image Processing**: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
- **Context Length**: 32,768 tokens with sliding window attention
### Technical Specifications
- **Hidden Size**: 896
- **Intermediate Size**: 4,864
- **Attention Heads**: 14 (2 key-value heads)
- **RMS Norm Epsilon**: 1e-6
- **RoPE Theta**: 1,000,000
- **Image Token Index**: 151646
- **Max Image Grid**: Up to 2304×2304 pixels with dynamic tiling
## Training Data
The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:
- **Scene Descriptions**: Detailed textual descriptions of visual content
- **Technical Metadata**: Camera settings, composition details
- **Style Analysis**: Photographic techniques and artistic elements
- **Quality Annotations**: Professional photography standards
The dataset focuses on enhancing the model's ability to:
- Identify specific products and technical details accurately
- Describe lighting conditions and photographic ambiance
- Analyze compositional elements and camera perspectives
- Generate contextually aware scene descriptions
## Limitations and Considerations
### Model Limitations
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
- **Training Data Bias**: May reflect biases present in the DataSeeds.AI dataset
- **Language Support**: Primarily trained and evaluated on English descriptions
### Recommended Use Cases
- ✅ Photography scene analysis and description
- ✅ Product photography captioning
- ✅ Technical photography analysis
- ✅ Visual content generation for photography applications
- ⚠️ General-purpose vision-language tasks (may have reduced performance)
- ❌ Non-photographic image analysis (not optimized for this use case)
### Ethical Considerations
- The model may perpetuate biases present in photography datasets
- Generated descriptions should be reviewed for accuracy in critical applications
- Consider potential cultural biases in photographic style interpretation
## Citation
If you use this model in your research or applications, please cite:
```bibtex
@article{abdoli2025peerranked,
title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery},
author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
journal={arXiv preprint arXiv:2506.05673},
year={2025},
}
@misc{llava-onevision-dsd-finetune-2024,
title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
author={DataSeeds.AI},
year={2024},
publisher={Hugging Face},
url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
note={LoRA fine-tuned model for enhanced photography description generation}
}
@article{li2024llavaonevision,
title={LLaVA-OneVision: Easy Visual Task Transfer},
author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
journal={arXiv preprint arXiv:2408.03326},
year={2024}
}
@article{hu2022lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
journal={arXiv preprint arXiv:2106.09685},
year={2021}
}
```
## License
This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.
## Acknowledgments
- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
- **Dataset**: GuruShots photography community for the source imagery
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities
---
*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.* |