BLIP Car Damage Captioner (Base/ViT-B) β v1
TL;DR: A fine-tuned BLIP base (ViT-B) model that generates structured captions for car damage imagery.
Captions typically mention color, view angle, completeness (full/partial car), and damage categories present in the scene.
- Base model:
Salesforce/blip-image-captioning-base(BSD-3) - Training data: CarDD (4k high-res images, 9k+ annotated instances, 6 damage categories) β images are not redistributed here; please request access from the CarDD authors.
- Intended use: research/education; descriptive captions for car images with visible damages.
Example usage
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration
repo_id = "gabrielfc2102/blip-cardd-captioner"
processor = BlipProcessor.from_pretrained(repo_id)
model = BlipForConditionalGeneration.from_pretrained(repo_id)
image = Image.open("your_test_image.jpg").convert("RGB")
inputs = processor(images=image, return_tensors="pt")
out = model.generate(**inputs)
print(processor.decode(out[0], skip_special_tokens=True))
Model Details
- Architecture: BLIP (Vision Transformer Base + text decoder), seq2seq image-to-text.
- Libraries: transformers, torch, timm, datasets.
- Resolution used: 384 Γ 384 during training.
Intended Uses & Limitations
You can use this model to:
- Generate neutral, descriptive captions for car images with damages.
- Support research on automated reporting and dataset bootstrapping.
Limitations / non-goals:
- Not designed for detection or segmentation; outputs are natural-language descriptions only.
- Not suitable for safety-critical or insurance decisions without human review.
- Generalization may degrade on non-car images or distributions far from CarDD.
Data
Source: CarDD (Car Damage Dataset). The dataset requires requesting access from the authors. This repo does not host images.
Captions: Programmatically composed from COCO-style annotations + CarDD metadata (color, shooting angle, completeness), with simple pluralization rules for damage categories.
Training
Environment
- Platform: Kaggle
- Accelerator: (1Γ NVIDIA P100)
Procedure
Base: Salesforce/blip-image-captioning-base
Epochs: 3
Batch size: 4
Optimizer: AdamW (lr 5e-5)
Scheduler: Linear, no warmup
Loss: Seq2Seq cross-entropy (labels = input_ids)
Image preprocessing: Resize to 384Γ384, RGB
Tokenization: BLIP tokenizer via BlipProcessor
Evaluation
Sanity-check on a subset of the test split (image-to-text):
BLEU: 0.6354
METEOR: 0.8254
ROUGE-L: 0.8050
Additional ROUGE details (subset):
ROUGE-1: 0.8087
ROUGE-2: 0.7170
ROUGE-Lsum: 0.8102
- Downloads last month
- 4
Model tree for gabrielfc2102/blip-cardd-captioner
Base model
Salesforce/blip-image-captioning-base