Meta CLIP 1/2
					Collection
				
Scaling CLIP data with transparent training distribution from an end-to-end pipeline.
					• 
				11 items
				• 
				Updated
					
				•
					
					15
MetaCLIP 2 (worldwide) was presented in MetaCLIP 2: A Worldwide Scaling Recipe.
This checkpoint corresponds to "ViT-H-14-quickgelu-worldwide" of the original implementation.
First install the Transformers library (from source for now):
pip install -q git+https://github.com/huggingface/transformers.git
Next you can use it like so:
import torch
from transformers import pipeline
clip = pipeline(
   task="zero-shot-image-classification",
   model="facebook/metaclip-2-worldwide-huge-quickgelu",
   torch_dtype=torch.bfloat16,
   device=0
)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
results = clip("http://images.cocodataset.org/val2017/000000039769.jpg", candidate_labels=labels)
print(results)
In case you want to perform pre- and postprocessing yourself, you can use the AutoModel API:
import requests
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModel
# note: make sure to verify that `AutoModel` is an instance of `MetaCLIP2Model`
model = AutoModel.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu", torch_dtype=torch.bfloat16, attn_implementation="sdpa")
processor = AutoProcessor.from_pretrained("facebook/metaclip-2-worldwide-huge-quickgelu")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
labels = ["a photo of a cat", "a photo of a dog", "a photo of a car"]
inputs = processor(text=labels, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)
most_likely_idx = probs.argmax(dim=1).item()
most_likely_label = labels[most_likely_idx]
print(f"Most likely label: {most_likely_label} with probability: {probs[0][most_likely_idx].item():.3f}")