mobileclip2-s0's image encoder is bad
here are my code:
from open_clip.src import open_clip
import torch
from torch import nn
from PIL import Image
from open_clip.src.open_clip.mobileclip2 import reparameterize_model
model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S0', pretrained='./weight/mobileclip2_s0.pt')
tokenizer = open_clip.get_tokenizer('MobileCLIP2-S0')
model = model.eval()
# For inference/model exporting purposes, please reparameterize first
image = preprocess(Image.open("./sxc.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat", "a person", 'a car', 'a plane', 'a tree', 'a building', 'a man'])
# with torch.no_grad(), torch.amp.autocast('cuda', dtype=torch.bfloat16):
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (image_features @ text_features.T)
print("cosine similarity:", text_probs)
text_probs = text_probs.softmax(dim=-1)
print("Label probs:", text_probs)
# fuse
model = reparameterize_model(model.eval())
with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    text_probs = (image_features @ text_features.T)
print("fused cosine similarity:", text_probs)
text_probs = text_probs.softmax(dim=-1)
print("fused Label probs:", text_probs)
and the result is:
cosine similarity: tensor([[ 0.0299, -0.0022,  0.0222,  0.0105,  0.0420,  0.0433, -0.0268, -0.0181,-0.0086]])
Label probs: tensor([[0.1133, 0.1097, 0.1124, 0.1111, 0.1147, 0.1148, 0.1070, 0.1080, 0.1090]])
fused cosine similarity: tensor([[ 0.0299, -0.0022,  0.0222,  0.0105,  0.0420,  0.0433, -0.0268, -0.0181,-0.0086]])
fused Label probs: tensor([[0.1133, 0.1097, 0.1124, 0.1111, 0.1147, 0.1148, 0.1070, 0.1080, 0.1090]])
and my test picture is a man:
during the inference, the image encoder output seems to be out-of-bound:
image_features = tensor([[ 1.7583e+03,  1.7638e+04, -1.4968e+04,  1.1477e+04, -1.8767e+04,
                           2.4064e+04, -6.5173e+03, -1.1029e+04, -8.0888e+03, -1.9555e+04,
                           2.2610e+04, -2.8043e+04,  3.2263e+03,  1.8139e+04, -2.4444e+03,
                          -1.9288e+04,  7.4792e+03,  6.4089e+03, -2.8449e+04,  9.3333e+03,
                          -1.2177e+03,  2.2655e+04, -1.1573e+04, -1.6503e+04,  3.0122e+04,
                           2.6668e+04,  8.3911e+03,  5.3360e+03, -8.0339e+03, -1.6359e+04,
                          -4.6262e+04,  3.0837e+04, -1.3061e+03,  8.4427e+03,  1.3132e+04,
                           8.6864e+03,  4.9244e+03, -1.4098e+03, -2.7712e+04,  9.4772e+03,
                          -4.3799e+04, -1.2634e+03, -1.0169e+04,  2.5884e+04, -1.3368e+04,
                          ...(and so on...)
text encoder seems to be okay, so i will not post it here.
weights are loaded successfully, run with cpu.
2025.9.4.11:50(UTC0)
Hi 
@SnifferCaptain
	 ,
Thanks for reporting this issue. Our S0/S2/B variants need a different preprocessing/normalization than our S3/S4/L-14 variants. The normalization for S0/S2/B variants is the same as our v1 variants which is mean=(0,0,0) and std=(1,1,1) while for our S3/S4/L-14 variants it is openai mean/std (default openclip normalization).
model, _, preprocess = open_clip.create_model_and_transforms('MobileCLIP2-S0', pretrained='/path/to/mobileclip2_s0.pt', image_mean=(0,0,0), image_std=(1,1,1))
@rwightman
	 has now integrated this into OpenCLIP and the correct preprocessing is loaded when one specifies pretrained="dfndr2b".
https://github.com/mlfoundations/open_clip/blob/13b01ec788c0c706a4d9ba66e301c8793aae0f0f/src/open_clip/pretrained.py#L629-L634

 
						