metadata
library_name: transformers
license: mit
tags:
- vision
- image-segmentation
- pytorch
EoMT
EoMT (Encoder-only Mask Transformer) is a Vision Transformer (ViT) architecture designed for high-quality and efficient image segmentation. It was introduced in the CVPR 2025 highlight paper:
Your ViT is Secretly an Image Segmentation Model
by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus.
Key Insight: Given sufficient scale and pretraining, a plain ViT along with additional few params can perform segmentation without the need for task-specific decoders or pixel fusion modules. The same model backbone supports semantic, instance, and panoptic segmentation with different post-processing 🤗
The original implementation can be found in this repository