--- language: - en library_name: transformers license: apache-2.0 pipeline_tag: zero-shot-image-classification tags: - clip --- # FG-CLIP: Fine-Grained Visual and Textual Alignment **[FG-CLIP: Fine-Grained Visual and Textual Alignment](https://arxiv.org/abs/2505.05071)**
Chunyu Xie*, Bin Wang*, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng†, Yuhui Yin(*Equal Contribution, ✝Corresponding Author)
[![arXiv](https://img.shields.io/badge/arXiv-2505.05071-b31b1b.svg)](https://arxiv.org/abs/2505.05071) [![ICML](https://img.shields.io/badge/ICML-2025-blue.svg)](https://icml.cc/Conferences/2025) [![GitHub](https://img.shields.io/badge/GitHub-Repository-blue?logo=github)](https://github.com/360CVGroup/FG-CLIP)

## Model Framework FG-CLIP’s training proceeds in two stages: the first stage leverages global-level caption-image pairs to achieve initial fine-grained alignment, while the second stage supplements these with additional region-level captions, including detailed region captions and positive/negative region descriptions to further refine the alignment.

## Quick Start 🤗 ### Load Model ```Shell import torch from PIL import Image from transformers import ( AutoImageProcessor, AutoTokenizer, AutoModelForCausalLM, ) model_root = "qihoo360/fg-clip-large" image_size=336 model = AutoModelForCausalLM.from_pretrained(model_root,trust_remote_code=True).cuda() device = model.device tokenizer = AutoTokenizer.from_pretrained(model_root) image_processor = AutoImageProcessor.from_pretrained(model_root) ``` ### Retrieval ```Shell img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg" image = Image.open(img_root).convert("RGB") image = image.resize((image_size,image_size)) image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device) # NOTE Short captions: max_length=77 && walk_short_pos=True walk_short_pos = True captions=["a photo of a cat", "a photo of a dog"] caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device) # NOTE Long captions: max_length=248 && walk_short_pos=False # ...... with torch.no_grad(): image_feature = model.get_image_features(image_input) text_feature = model.get_text_features(caption_input,walk_short_pos=walk_short_pos) image_feature = image_feature / image_feature.norm(p=2, dim=-1, keepdim=True) text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True) logits_per_image = image_feature @ text_feature.T logits_per_image = model.logit_scale.exp() * logits_per_image probs = logits_per_image.softmax(dim=1) print(probs) ``` ### Dense feature effect display ```Shell import math import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt img_root = "FG-CLIP/use_imgs/cat_dfclor.jpg" image = Image.open(img_root).convert("RGB") image = image.resize((image_size,image_size)) image_input = image_processor.preprocess(image, return_tensors='pt')['pixel_values'].to(device) with torch.no_grad(): dense_image_feature = model.get_image_dense_features(image_input) captions = ["white cat"] caption_input = torch.tensor(tokenizer(captions, max_length=77, padding="max_length", truncation=True).input_ids, dtype=torch.long, device=device) text_feature = model.get_text_features(caption_input,walk_short_pos=True) text_feature = text_feature / text_feature.norm(p=2, dim=-1, keepdim=True) dense_image_feature = dense_image_feature / dense_image_feature.norm(p=2, dim=-1, keepdim=True) similarity = dense_image_feature.squeeze() @ text_feature.squeeze().T similarity = similarity.cpu().numpy() patch_size = int(math.sqrt(similarity.shape[0])) original_shape = (patch_size, patch_size) show_image = similarity.reshape(original_shape) plt.figure(figsize=(6, 6)) plt.imshow(show_image) plt.title('similarity Visualization') plt.axis('off') plt.savefig("FG-CLIP/use_imgs/FGCLIP_dfcolor_cat.png") ```

## Citation If you find FG-CLIP useful for your research and applications, please cite using this BibTeX: ``` @article{xie2025fg, title={FG-CLIP: Fine-Grained Visual and Textual Alignment}, author={Xie, Chunyu and Wang, Bin and Kong, Fanjing and Li, Jincheng and Liang, Dawei and Zhang, Gengshen and Leng, Dawei and Yin, Yuhui}, journal={arXiv preprint arXiv:2505.05071}, year={2025} } ``` Code: https://github.com/360CVGroup/FG-CLIP ## License This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the [Apache license 2.0](./LICENSE).