YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

AMOE: Agglomerative Mixture-of-Experts Vision Foundation Model

Project Website arXiv Hugging Face Datasets

A vision encoder distilled from DINOv3 and SigLIP2 teachers, supporting multi-resolution image understanding with Mixture-of-Experts (MoE) architecture.

AMOE is an MoE vision foundation model distilled from DINOv3 and SigLIP2 teachers.

Installation

pip install torch transformers einops pillow

Quick Start

import torch
from PIL import Image
from transformers import AutoModel, AutoImageProcessor

# Load model and processor
model_id = "tiiuae/amoe" 
model = AutoModel.from_pretrained(model_id, trust_remote_code=True).to("cuda", dtype=torch.bfloat16)
processor = AutoImageProcessor.from_pretrained(model_id, trust_remote_code=True)

# Preprocess image
image = Image.open("image.jpg").convert("RGB")
inputs = processor(image, return_tensors="pt").to("cuda")
inputs["pixel_values"] = inputs["pixel_values"].to(torch.bfloat16)

# Inference
with torch.no_grad():
    outputs = model(**inputs)

# Access specialized features
# Options: 'amoe' (768d), 'siglip2' (1152d), 'dinov3' (1024d)
patch_features = outputs["patch_features"]["amoe"]    # (Batch, Tokens, 768)
summary_features = outputs["summary_features"]["siglip2"] # (Batch, 1152)

Citation

If you use AMoE in your research, please cite:

@article{chaybouti2025amoe,
  title={AMOE: Agglomerative Mixture-of-Experts Vision Foundation Models},
  author={Chaybouti, Sofian and Narayan, Sanath and Dahou, Yasser and Le Khac, Phuc H. and Singh, Ankit and Huynh, Ngoc Dung and Para, Wamiq Reyaz and Kuehne, Hilde and Hacid, Hakim},
  journal={arXiv preprint arXiv:2512.20157},
  year={2025}
}
Downloads last month
13
Safetensors
Model size
0.5B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support