Image-to-Text
English

ClipCap

This is an implementation of the ClipCap model — a captioning system that connects CLIP vision features to a GPT-2 language model via a learnable prefix.

The provided checkpoint (coco_prefix_best_200k.pt) was trained on 203,914 samples from the Conceptual Captions dataset using prefix tuning.

Model Architecture

  • Vision Encoder: CLIP
  • Language Model: GPT-2 (via Hugging Face Transformers)
  • Connector: Multi-Layer Perceptron (MLP) to map CLIP embeddings to GPT-2 prefix tokens

Usage

To use this model, define the ClipCapModel architecture as described in the main.py file and load the checkpoint into your model instance. You’ll also need to obtain CLIP embeddings of the image as input.

Refer to the original ClipCap repository for preprocessing and full inference pipeline details.

Reference

Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saad1926q/clipcap-image-captioning

Finetuned
(1917)
this model

Dataset used to train saad1926q/clipcap-image-captioning