prithivMLmods
/

DeepCaption-VLA-7B

+---
+license: apache-2.0
+language:
+- en
+- zh
+library_name: transformers
+base_model:
+- Qwen/Qwen2.5-VL-7B-Instruct
+pipeline_tag: image-text-to-text
+tags:
+- trl
+- VisionLanguageAttribution
+- VisualUnderstanding
+- text-generation-inference
+- AttributeCaptioning
+- VLA
+datasets:
+- prithivMLmods/blip3o-caption-mini-arrow
+- prithivMLmods/Caption3o-Opt-v3
+- prithivMLmods/Caption3o-Opt-v2
+- >-
+  Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647
+---
+![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/y7C3BvR9PCOwy6I478tkY.png)
+# **DeepCaption-VLA-7B**
+> The **DeepCaption-VLA-7B** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, tailored for **Image Captioning** and **Vision Language Attribution**. This variant is designed to generate precise, highly descriptive captions with a focus on **defining visual properties, object attributes, and scene details** across a wide spectrum of images and aspect ratios.
+# Key Highlights
+1. **Vision Language Attribution (VLA):** Specially fine-tuned to attribute and define visual properties of objects, scenes, and environments.
+2. **Detailed Object Definitions:** Generates captions with rich attribute descriptions, making outputs more precise than generic captioners.
+3. **High-Fidelity Descriptions:** Handles general, artistic, technical, abstract, and low-context images with descriptive depth.
+4. **Robust Across Aspect Ratios:** Accurately captions images regardless of format—wide, tall, square, or irregular.
+5. **Variational Detail Control:** Supports both concise summaries and fine-grained attributions depending on prompt structure.
+6. **Foundation on Qwen2.5-VL Architecture:** Leverages Qwen2.5-VL-7B’s multimodal reasoning for visual comprehension and instruction-following.
+7. **Multilingual Capability:** Default in English, but adaptable for multilingual captioning through prompt engineering.
+> model type: experimental
+# Training Details
+This model was fine-tuned with a curated mix of datasets focused on **caption richness and object-attribute alignment**:
+* [prithivMLmods/blip3o-caption-mini-arrow](https://huggingface.co/datasets/prithivMLmods/blip3o-caption-mini-arrow)
+* [prithivMLmods/Caption3o-Opt-v3](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v3)
+* [prithivMLmods/Caption3o-Opt-v2](https://huggingface.co/datasets/prithivMLmods/Caption3o-Opt-v2)
+* [Multimodal-Fatima/Caltech101\_not\_background\_test\_facebook\_opt\_2.7b\_Attributes\_Caption\_ns\_5647](https://huggingface.co/datasets/Multimodal-Fatima/Caltech101_not_background_test_facebook_opt_2.7b_Attributes_Caption_ns_5647)
+The training objective emphasized **Vision Language Attribution**: defining image properties, attributes, and objects with clarity, while preserving descriptive fluency.
+---
+## SYSTEM_PROMPT
+```py
+CAPTION_SYSTEM_PROMPT = """
+You are an AI assistant that rigorously follows this response protocol:
+1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.
+2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.
+3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.
+   - Use the syntax: `{class_name==write_the_core_theme}`
+   - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`
+4. Maintain the following strict format in your output:
+   - **Caption:** <one-sentence description>
+   - **Attributes:** <comma-separated list of visual attributes>
+   - **{class_name==core_theme}**
+5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.
+6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.
+""".strip()
+```
+---
+# Quick Start with Transformers
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "prithivMLmods/DeepCaption-VLA-7B", torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-7B")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image with detailed attributes and properties."},
+        ],
+    }
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+# Intended Use
+* Generating attribute-rich image captions for research, dataset creation, and AI training.
+* Vision-language attribution for object detection, scene understanding, and dataset annotation.
+* Supporting creative, artistic, and technical applications requiring detailed descriptions.
+* Captioning across varied aspect ratios, unusual visual styles, and non-standard datasets.
+# Limitations
+* May over-attribute or infer properties not explicitly visible in ambiguous images.
+* Outputs can vary in tone depending on prompt phrasing.
+* Not intended for filtered captioning tasks (explicit or sensitive content may appear).
+* Accuracy may degrade on synthetic or highly abstract visual domains.