VER: Vision Expert Transformer for Robot Learning via Foundation Distillation and Dynamic Routing
Abstract
VER, a Vision Expert Transformer, dynamically selects task-relevant experts from a pretrained vision expert library, achieving state-of-the-art performance across diverse robotic tasks with parameter-efficient fine-tuning.
Pretrained vision foundation models (VFMs) advance robotic learning via rich visual representations, yet individual VFMs typically excel only in specific domains, limiting generality across tasks. Distilling multiple VFMs into a unified representation for policy can mitigate this limitation but often yields inflexible task-specific feature selection and requires costly full re-training to incorporate robot-domain knowledge. We propose VER, a Vision Expert transformer for Robot learning. During pretraining, VER distills multiple VFMs into a vision expert library. It then fine-tunes only a lightweight routing network (fewer than 0.4% of parameters) to dynamically select task-relevant experts from the pretrained library for downstream robot tasks. We further introduce Patchwise Expert Routing with Curriculum Top-K Annealing to improve both flexibility and precision of dynamic expert selection. Moreover, VER supports parameter-efficient finetuning for scalable expert utilization and adaptive robot-domain knowledge integration. Across 17 diverse robotic tasks and multiple policy heads, VER achieves state-of-the-art performance. We find that VER reduces large-norm outliers in task-irrelevant regions (e.g., background) and concentrates on task-critical regions. Visualizations and codes can be found in https://yixiaowang7.github.io/ver_page/.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniCoD: Enhancing Robot Policy via Unified Continuous and Discrete Representation Learning (2025)
- VGGT-DP: Generalizable Robot Control via Vision Foundation Models (2025)
- VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation (2025)
- GP3: A 3D Geometry-Aware Policy with Multi-View Images for Robotic Manipulation (2025)
- TinyGiantVLM: A Lightweight Vision-Language Architecture for Spatial Reasoning under Resource Constraints (2025)
- Improving Robotic Manipulation with Efficient Geometry-Aware Vision Encoder (2025)
- Generative Visual Foresight Meets Task-Agnostic Pose Estimation in Robotic Table-Top Manipulation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper