BLIP Car Damage Captioner (Base/ViT-B) — v1

TL;DR: A fine-tuned BLIP base (ViT-B) model that generates structured captions for car damage imagery.
Captions typically mention color, view angle, completeness (full/partial car), and damage categories present in the scene.

Base model: Salesforce/blip-image-captioning-base (BSD-3)
Training data: CarDD (4k high-res images, 9k+ annotated instances, 6 damage categories) — images are not redistributed here; please request access from the CarDD authors.
Intended use: research/education; descriptive captions for car images with visible damages.

Example usage

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

repo_id = "gabrielfc2102/blip-cardd-captioner"

processor = BlipProcessor.from_pretrained(repo_id)
model = BlipForConditionalGeneration.from_pretrained(repo_id)

image = Image.open("your_test_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Model Details

Architecture: BLIP (Vision Transformer Base + text decoder), seq2seq image-to-text.
Libraries: transformers, torch, timm, datasets.
Resolution used: 384 × 384 during training.

Intended Uses & Limitations

You can use this model to:

Generate neutral, descriptive captions for car images with damages.
Support research on automated reporting and dataset bootstrapping.

Limitations / non-goals:

Not designed for detection or segmentation; outputs are natural-language descriptions only.
Not suitable for safety-critical or insurance decisions without human review.
Generalization may degrade on non-car images or distributions far from CarDD.

Data

Source: CarDD (Car Damage Dataset). The dataset requires requesting access from the authors. This repo does not host images.
Captions: Programmatically composed from COCO-style annotations + CarDD metadata (color, shooting angle, completeness), with simple pluralization rules for damage categories.

Training

Environment

Platform: Kaggle
Accelerator: (1× NVIDIA P100)

Procedure

Base: Salesforce/blip-image-captioning-base
Epochs: 3
Batch size: 4
Optimizer: AdamW (lr 5e-5)
Scheduler: Linear, no warmup
Loss: Seq2Seq cross-entropy (labels = input_ids)
Image preprocessing: Resize to 384×384, RGB
Tokenization: BLIP tokenizer via BlipProcessor

Evaluation

Sanity-check on a subset of the test split (image-to-text):

BLEU: 0.6354
METEOR: 0.8254
ROUGE-L: 0.8050

Additional ROUGE details (subset):

ROUGE-1: 0.8087
ROUGE-2: 0.7170
ROUGE-Lsum: 0.8102

Downloads last month: 4

Safetensors

Model size

0.2B params

Tensor type

F32

Model tree for gabrielfc2102/blip-cardd-captioner

Base model

Salesforce/blip-image-captioning-base

Finetuned

(42)

this model

gabrielfc2102
/

blip-cardd-captioner