BLIP Car Damage Captioner (Base/ViT-B) β€” v1

TL;DR: A fine-tuned BLIP base (ViT-B) model that generates structured captions for car damage imagery.
Captions typically mention color, view angle, completeness (full/partial car), and damage categories present in the scene.

  • Base model: Salesforce/blip-image-captioning-base (BSD-3)
  • Training data: CarDD (4k high-res images, 9k+ annotated instances, 6 damage categories) β€” images are not redistributed here; please request access from the CarDD authors.
  • Intended use: research/education; descriptive captions for car images with visible damages.

Example usage

from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

repo_id = "gabrielfc2102/blip-cardd-captioner"

processor = BlipProcessor.from_pretrained(repo_id)
model = BlipForConditionalGeneration.from_pretrained(repo_id)

image = Image.open("your_test_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))

Model Details

  • Architecture: BLIP (Vision Transformer Base + text decoder), seq2seq image-to-text.
  • Libraries: transformers, torch, timm, datasets.
  • Resolution used: 384 Γ— 384 during training.

Intended Uses & Limitations

You can use this model to:

  • Generate neutral, descriptive captions for car images with damages.
  • Support research on automated reporting and dataset bootstrapping.

Limitations / non-goals:

  • Not designed for detection or segmentation; outputs are natural-language descriptions only.
  • Not suitable for safety-critical or insurance decisions without human review.
  • Generalization may degrade on non-car images or distributions far from CarDD.

Data

  • Source: CarDD (Car Damage Dataset). The dataset requires requesting access from the authors. This repo does not host images.

  • Captions: Programmatically composed from COCO-style annotations + CarDD metadata (color, shooting angle, completeness), with simple pluralization rules for damage categories.

Training

Environment

  • Platform: Kaggle
  • Accelerator: (1Γ— NVIDIA P100)

Procedure

  • Base: Salesforce/blip-image-captioning-base

  • Epochs: 3

  • Batch size: 4

  • Optimizer: AdamW (lr 5e-5)

  • Scheduler: Linear, no warmup

  • Loss: Seq2Seq cross-entropy (labels = input_ids)

  • Image preprocessing: Resize to 384Γ—384, RGB

  • Tokenization: BLIP tokenizer via BlipProcessor

Evaluation

Sanity-check on a subset of the test split (image-to-text):

  • BLEU: 0.6354

  • METEOR: 0.8254

  • ROUGE-L: 0.8050

Additional ROUGE details (subset):

  • ROUGE-1: 0.8087

  • ROUGE-2: 0.7170

  • ROUGE-Lsum: 0.8102

Downloads last month
4
Safetensors
Model size
0.2B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for gabrielfc2102/blip-cardd-captioner

Finetuned
(42)
this model

Dataset used to train gabrielfc2102/blip-cardd-captioner