ClipCap
This is an implementation of the ClipCap model — a captioning system that connects CLIP vision features to a GPT-2 language model via a learnable prefix.
The provided checkpoint (coco_prefix_best_200k.pt) was trained on 203,914 samples from the Conceptual Captions dataset using prefix tuning.
Model Architecture
- Vision Encoder: CLIP
- Language Model: GPT-2 (via Hugging Face Transformers)
- Connector: Multi-Layer Perceptron (MLP) to map CLIP embeddings to GPT-2 prefix tokens
Usage
To use this model, define the ClipCapModel architecture as described in the main.py file and load the checkpoint into your model instance. You’ll also need to obtain CLIP embeddings of the image as input.
Refer to the original ClipCap repository for preprocessing and full inference pipeline details.
Reference
Mokady, R., Hertz, A., & Bermano, A. H. (2021). ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.
Model tree for saad1926q/clipcap-image-captioning
Base model
openai-community/gpt2