MetaCLIP2 Image Classification Experiments
Collection
Domain-Specific Downstream Tasks
•
5 items
•
Updated
•
2
MetaCLIP-2-Open-Scene is an image classification vision-language encoder model fine-tuned from facebook/metaclip-2-worldwide-s16 for a single-label classification task. It is designed to identify and categorize various natural and urban scenes using the MetaClip2ForImageClassification architecture.
MetaCLIP 2: A Worldwide Scaling Recipe : https://huggingface.co/papers/2507.22062
Classification Report:
precision recall f1-score support
buildings 0.9644 0.9703 0.9673 2625
forest 0.9948 0.9978 0.9963 2694
glacier 0.9531 0.9427 0.9479 2671
mountain 0.9470 0.9512 0.9491 2723
sea 0.9909 0.9920 0.9915 2758
street 0.9728 0.9694 0.9711 2874
accuracy 0.9706 16345
macro avg 0.9705 0.9706 0.9705 16345
weighted avg 0.9706 0.9706 0.9706 16345
The model classifies images into six open-scene categories:
!pip install -q transformers torch pillow gradio
import gradio as gr
from transformers import AutoImageProcessor
from transformers import AutoModelForImageClassification
from transformers.image_utils import load_image
from PIL import Image
import torch
# Load model and processor
model_name = "prithivMLmods/MetaCLIP-2-Open-Scene"
model = AutoModelForImageClassification.from_pretrained(model_name)
processor = AutoImageProcessor.from_pretrained(model_name)
def scene_classification(image):
"""Predicts the type of scene represented in an image."""
image = Image.fromarray(image).convert("RGB")
inputs = processor(images=image, return_tensors="pt")
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = torch.nn.functional.softmax(logits, dim=1).squeeze().tolist()
labels = {
"0": "buildings",
"1": "forest",
"2": "glacier",
"3": "mountain",
"4": "sea",
"5": "street"
}
predictions = {labels[str(i)]: round(probs[i], 3) for i in range(len(probs))}
return predictions
# Create Gradio interface
iface = gr.Interface(
fn=scene_classification,
inputs=gr.Image(type="numpy"),
outputs=gr.Label(label="Prediction Scores"),
title="Open Scene Classification",
description="Upload an image to classify the scene type (e.g., forest, sea, street, mountain, etc.)."
)
# Launch the app
if __name__ == "__main__":
iface.launch()
The MetaCLIP-2-Open-Scene model is designed to classify a wide range of natural and urban environments. Potential use cases include:
Base model
facebook/metaclip-2-worldwide-s16