mobileclip2-s0's image encoder is bad

by SnifferCaptain - opened Sep 4

Sep 4

here are my code:

from open_clip.src import open_clip
import torch
from torch import nn
from PIL import Image
from open_clip.src.open_clip.mobileclip2 import reparameterize_model

model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S0', pretrained='./weight/mobileclip2_s0.pt')
tokenizer = open_clip.get_tokenizer('MobileCLIP2-S0')
model = model.eval()

# For inference/model exporting purposes, please reparameterize first

image = preprocess(Image.open("./sxc.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat", "a person", 'a car', 'a plane', 'a tree', 'a building', 'a man'])

# with torch.no_grad(), torch.amp.autocast('cuda', dtype=torch.bfloat16):
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (image_features @ text_features.T)

print("cosine similarity:", text_probs)
text_probs = text_probs.softmax(dim=-1)
print("Label probs:", text_probs)

# fuse
model = reparameterize_model(model.eval())
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (image_features @ text_features.T)
print("fused cosine similarity:", text_probs)
text_probs = text_probs.softmax(dim=-1)
print("fused Label probs:", text_probs)

and the result is:

cosine similarity: tensor([[ 0.0299, -0.0022,  0.0222,  0.0105,  0.0420,  0.0433, -0.0268, -0.0181,-0.0086]])
Label probs: tensor([[0.1133, 0.1097, 0.1124, 0.1111, 0.1147, 0.1148, 0.1070, 0.1080, 0.1090]])
fused cosine similarity: tensor([[ 0.0299, -0.0022,  0.0222,  0.0105,  0.0420,  0.0433, -0.0268, -0.0181,-0.0086]])
fused Label probs: tensor([[0.1133, 0.1097, 0.1124, 0.1111, 0.1147, 0.1148, 0.1070, 0.1080, 0.1090]])

and my test picture is a man:

during the inference, the image encoder output seems to be out-of-bound:

image_features = tensor([[ 1.7583e+03,  1.7638e+04, -1.4968e+04,  1.1477e+04, -1.8767e+04,
                           2.4064e+04, -6.5173e+03, -1.1029e+04, -8.0888e+03, -1.9555e+04,
                           2.2610e+04, -2.8043e+04,  3.2263e+03,  1.8139e+04, -2.4444e+03,
                          -1.9288e+04,  7.4792e+03,  6.4089e+03, -2.8449e+04,  9.3333e+03,
                          -1.2177e+03,  2.2655e+04, -1.1573e+04, -1.6503e+04,  3.0122e+04,
                           2.6668e+04,  8.3911e+03,  5.3360e+03, -8.0339e+03, -1.6359e+04,
                          -4.6262e+04,  3.0837e+04, -1.3061e+03,  8.4427e+03,  1.3132e+04,
                           8.6864e+03,  4.9244e+03, -1.4098e+03, -2.7712e+04,  9.4772e+03,
                          -4.3799e+04, -1.2634e+03, -1.0169e+04,  2.5884e+04, -1.3368e+04,
                          ...(and so on...)

text encoder seems to be okay, so i will not post it here.
weights are loaded successfully, run with cpu.
2025.9.4.11:50(UTC0)

fartashf

Apple org Sep 11

•

edited Sep 11

Hi @SnifferCaptain ,
Thanks for reporting this issue. Our S0/S2/B variants need a different preprocessing/normalization than our S3/S4/L-14 variants. The normalization for S0/S2/B variants is the same as our v1 variants which is mean=(0,0,0) and std=(1,1,1) while for our S3/S4/L-14 variants it is openai mean/std (default openclip normalization).

model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S0', pretrained='/path/to/mobileclip2_s0.pt', image_mean=(0,0,0), image_std=(1,1,1))

@rwightman has now integrated this into OpenCLIP and the correct preprocessing is loaded when one specifies pretrained="dfndr2b".
https://github.com/mlfoundations/open_clip/blob/13b01ec788c0c706a4d9ba66e301c8793aae0f0f/src/open_clip/pretrained.py#L629-L634

SnifferCaptain changed discussion status to closed Sep 12

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment