File size: 11,272 Bytes
b8414fe
 
c328015
 
 
 
 
 
 
b8414fe
 
 
 
1965c3a
b8414fe
 
 
1965c3a
 
 
807e54d
1965c3a
 
 
 
 
57e3103
c328015
1965c3a
 
 
 
 
c328015
1965c3a
 
 
 
 
c328015
1965c3a
b8414fe
 
57e3103
b8414fe
432d75f
b8414fe
685c4a5
 
 
b8414fe
 
1965c3a
 
 
 
 
 
 
 
 
 
 
 
b8414fe
 
1965c3a
 
b8414fe
 
 
1965c3a
432d75f
1965c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8414fe
1965c3a
 
 
 
 
 
 
 
 
b8414fe
 
 
1965c3a
 
 
 
 
 
 
 
b8414fe
 
1965c3a
b8414fe
1965c3a
b8414fe
1965c3a
b8414fe
 
 
1965c3a
 
b8414fe
 
1965c3a
 
b8414fe
 
 
807e54d
b8414fe
 
1965c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8414fe
1965c3a
 
 
 
 
 
 
 
 
 
 
 
 
c328015
 
b8414fe
 
 
 
1965c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b8414fe
 
 
432d75f
1965c3a
 
 
 
 
 
 
 
 
 
 
b8414fe
1965c3a
b8414fe
1965c3a
 
 
432d75f
1965c3a
 
 
 
 
 
 
 
 
 
 
 
 
 
b8414fe
 
 
1965c3a
b8414fe
 
00eada5
c328015
00eada5
 
095d094
00eada5
 
807e54d
383aac5
432d75f
b8414fe
 
807e54d
1965c3a
b8414fe
 
 
1965c3a
 
 
 
 
 
 
 
 
 
 
b8414fe
 
 
 
 
1965c3a
 
 
 
 
 
4cb6ae2
1965c3a
 
 
 
c328015
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
---
base_model: lmms-lab/llava-onevision-qwen2-0.5b-ov
datasets:
- Dataseeds/DataSeeds-Sample-Dataset-DSD
language:
- en
library_name: transformers
license: apache-2.0
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- llava
- llava-onevision
- lora
- fine-tuned
- photography
- scene-analysis
- image-captioning
model-index:
- name: LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune
  results:
  - task:
      type: image-captioning
      name: Image Captioning
    dataset:
      name: DataSeeds.AI Sample Dataset
      type: Dataseeds/DataSeeds-Sample-Dataset-DSD
    metrics:
    - type: bleu-4
      value: 0.0246
      name: BLEU-4
    - type: rouge-l
      value: 0.214
      name: ROUGE-L
    - type: bertscore
      value: 0.2789
      name: BERTScore F1
    - type: clipscore
      value: 0.326
      name: CLIPScore
---

# LLaVA-OneVision-Qwen2-0.5b Fine-tuned on DataSeeds.AI Dataset

This model is a LoRA (Low-Rank Adaptation) fine-tuned version of [lmms-lab/llava-onevision-qwen2-0.5b-ov](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov) specialized for photography scene analysis and description generation. The model was presented in the paper [Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery](https://huggingface.co/papers/2506.05673). The model was fine-tuned on the [DataSeeds.AI Sample Dataset (DSD)](https://huggingface.co/datasets/Dataseeds/DataSeeds.AI-Sample-Dataset-DSD) to enhance its capabilities in generating detailed, accurate descriptions of photographic content.

Code for usage: https://github.com/DataSeeds-ai/DSD-finetune-blip-llava


## Model Description

- **Base Model**: [LLaVA-OneVision-Qwen2-0.5b](https://huggingface.co/lmms-lab/llava-onevision-qwen2-0.5b-ov)
- **Vision Encoder**: [SigLIP-SO400M-patch14-384](https://huggingface.co/google/siglip-so400m-patch14-384)
- **Language Model**: Qwen2-0.5B (896M parameters)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation) with PEFT
- **Total Parameters**: ~917M (513M trainable during fine-tuning, 56% of total)
- **Multimodal Projector**: 1.84M parameters (100% trainable)
- **Precision**: BFloat16
- **Task**: Photography scene analysis and detailed image description

### LoRA Configuration

- **LoRA Rank (r)**: 32
- **LoRA Alpha**: 32
- **LoRA Dropout**: 0.1
- **Target Modules**: `v_proj`, `k_proj`, `q_proj`, `up_proj`, `gate_proj`, `down_proj`, `o_proj`
- **Tunable Components**: `mm_mlp_adapter`, `mm_language_model`

## Training Details

### Dataset
The model was fine-tuned on the DataSeeds.AI Sample Dataset, a curated collection of photography images with detailed scene descriptions focusing on:
- Compositional elements and camera perspectives
- Lighting conditions and visual ambiance
- Product identification and technical details
- Photographic style and mood analysis

### Training Configuration

| Parameter | Value |
|-----------|-------|
| **Learning Rate** | 1e-5 |
| **Optimizer** | AdamW |
| **Learning Rate Schedule** | Cosine decay |
| **Warmup Ratio** | 0.03 |
| **Weight Decay** | 0.01 |
| **Batch Size** | 2 |
| **Gradient Accumulation Steps** | 8 (effective batch size: 16) |
| **Training Epochs** | 3 |
| **Max Sequence Length** | 8192 |
| **Max Gradient Norm** | 0.5 |
| **Precision** | BFloat16 |
| **Hardware** | Single NVIDIA A100 40GB |
| **Training Time** | 30 hours |

### Training Strategy
- **Validation Frequency**: Every 50 steps for precise checkpoint selection
- **Best Checkpoint**: Step 1,750 (epoch 2.9) with validation loss of 1.83
- **Mixed Precision**: BFloat16 with gradient checkpointing for memory efficiency
- **System Prompt**: Consistent template requesting scene descriptions across all samples

## Performance

### Quantitative Results

The fine-tuned model shows significant improvements across all evaluation metrics compared to the base model:

| Metric | Base Model | Fine-tuned | Absolute Δ | Relative Δ |
|--------|------------|------------|------------|------------|
| **BLEU-4** | 0.0199 | **0.0246** | +0.0048 | **+24.09%** |
| **ROUGE-L** | 0.2089 | **0.2140** | +0.0051 | **+2.44%** |
| **BERTScore F1** | 0.2751 | **0.2789** | +0.0039 | **+1.40%** |
| **CLIPScore** | 0.3247 | **0.3260** | +0.0013 | **+0.41%** |

### Key Improvements
- **Enhanced N-gram Precision**: 24% improvement in BLEU-4 indicates significantly better word sequence accuracy
- **Better Sequential Information**: ROUGE-L improvement shows enhanced capture of longer matching sequences
- **Improved Semantic Understanding**: BERTScore gains demonstrate better contextual relationships
- **Maintained Visual-Semantic Alignment**: CLIPScore preservation with slight improvement

### Inference Performance
- **Processing Speed**: 2.30 seconds per image (NVIDIA A100 40GB)
- **Memory Requirements**: Optimized for single GPU inference

## Usage

### Installation

```bash
pip install transformers torch peft pillow
```

### Basic Usage

```python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image

# Load base model and processor
base_model = AutoModelForCausalLM.from_pretrained(
    "lmms-lab/llava-onevision-qwen2-0.5b-ov",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

processor = AutoProcessor.from_pretrained("lmms-lab/llava-onevision-qwen2-0.5b-ov")

# Load LoRA adapter
model = PeftModel.from_pretrained(
    base_model,
    "Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune"
)

# Load and process image
image = Image.open("your_image.jpg")
prompt = "Describe this image in detail, focusing on the composition, lighting, and visual elements."

inputs = processor(prompt, image, return_tensors="pt").to(model.device)

# Generate description
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        do_sample=True,
        temperature=0.7,
        top_p=0.9
    )

description = processor.decode(outputs[0], skip_special_tokens=True)
print(description)
```

### Advanced Usage with Custom Prompts

```python
# Photography-specific prompts that work well with this model
prompts = [
    "Analyze the photographic composition and lighting in this image.",
    "Describe the technical aspects and visual mood of this photograph.",
    "Provide a detailed scene description focusing on the subject and environment."
]

for prompt in prompts:
    inputs = processor(prompt, image, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=256, temperature=0.7)
    description = processor.decode(outputs[0], skip_special_tokens=True)
    print(f"Prompt: {prompt}")
    print(f"Description: {description}
")
```

## Model Architecture

The model maintains the LLaVA-OneVision architecture with the following components:

- **Vision Encoder**: SigLIP-SO400M with hierarchical feature extraction
- **Language Model**: Qwen2-0.5B with 24 layers, 14 attention heads
- **Multimodal Projector**: 2-layer MLP with GELU activation (mlp2x_gelu)
- **Image Processing**: Supports "anyres_max_9" aspect ratio with dynamic grid pinpoints
- **Context Length**: 32,768 tokens with sliding window attention

### Technical Specifications

- **Hidden Size**: 896
- **Intermediate Size**: 4,864
- **Attention Heads**: 14 (2 key-value heads)
- **RMS Norm Epsilon**: 1e-6
- **RoPE Theta**: 1,000,000
- **Image Token Index**: 151646
- **Max Image Grid**: Up to 2304×2304 pixels with dynamic tiling

## Training Data

The DataSeeds.AI Sample Dataset contains curated photography images with comprehensive annotations including:

- **Scene Descriptions**: Detailed textual descriptions of visual content
- **Technical Metadata**: Camera settings, composition details
- **Style Analysis**: Photographic techniques and artistic elements
- **Quality Annotations**: Professional photography standards

The dataset focuses on enhancing the model's ability to:
- Identify specific products and technical details accurately
- Describe lighting conditions and photographic ambiance
- Analyze compositional elements and camera perspectives
- Generate contextually aware scene descriptions

## Limitations and Considerations

### Model Limitations
- **Domain Specialization**: Optimized for photography; may have reduced performance on general vision-language tasks
- **Base Model Inheritance**: Inherits limitations from LLaVA-OneVision base model
- **Training Data Bias**: May reflect biases present in the DataSeeds.AI dataset
- **Language Support**: Primarily trained and evaluated on English descriptions

### Recommended Use Cases
- ✅ Photography scene analysis and description
- ✅ Product photography captioning
- ✅ Technical photography analysis
- ✅ Visual content generation for photography applications
- ⚠️ General-purpose vision-language tasks (may have reduced performance)
- ❌ Non-photographic image analysis (not optimized for this use case)

### Ethical Considerations
- The model may perpetuate biases present in photography datasets
- Generated descriptions should be reviewed for accuracy in critical applications
- Consider potential cultural biases in photographic style interpretation

## Citation

If you use this model in your research or applications, please cite:

```bibtex
@article{abdoli2025peerranked,
    title={Peer-Ranked Precision: Creating a Foundational Dataset for Fine-Tuning Vision Models from DataSeeds' Annotated Imagery}, 
    author={Sajjad Abdoli and Freeman Lewin and Gediminas Vasiliauskas and Fabian Schonholz},
    journal={arXiv preprint arXiv:2506.05673},
    year={2025},
}

@misc{llava-onevision-dsd-finetune-2024,
  title={LLaVA-OneVision Fine-tuned on DataSeeds.AI Dataset for Photography Scene Analysis},
  author={DataSeeds.AI},
  year={2024},
  publisher={Hugging Face},
  url={https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune},
  note={LoRA fine-tuned model for enhanced photography description generation}
}

@article{li2024llavaonevision,
  title={LLaVA-OneVision: Easy Visual Task Transfer},
  author={Li, Bo and Zhang, Yuanhan and Guo, Dong and Zhang, Renrui and Li, Feng and Zhang, Hao and Zhang, Kaichen and Liu, Yanwei and Wang, Ziwei and Gao, Peng},
  journal={arXiv preprint arXiv:2408.03326},
  year={2024}
}

@article{hu2022lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu},
  journal={arXiv preprint arXiv:2106.09685},
  year={2021}
}
```

## License

This model is released under the Apache 2.0 license, consistent with the base LLaVA-OneVision model licensing terms.

## Acknowledgments

- **Base Model**: Thanks to LMMS Lab for the LLaVA-OneVision model
- **Vision Encoder**: Thanks to Google Research for the SigLIP model
- **Dataset**: GuruShots photography community for the source imagery
- **Framework**: Hugging Face PEFT library for efficient fine-tuning capabilities

---

*For questions, issues, or collaboration opportunities, please visit the [model repository](https://huggingface.co/Dataseeds/LLaVA-OneVision-Qwen2-0.5b-ov-DSD-FineTune) or contact the DataSeeds.AI team.*