arxiv:2511.23199

Vision Bridge Transformer at Scale

Published on Nov 28

· Submitted by

Xingyi Yang on Dec 1

Upvote

Authors:

Zhenxiong Tan ,

Zeqing Wang ,

Xingyi Yang ,

Songhua Liu ,

Abstract

Bridge Models, instantiated as Vision Bridge Transformer (ViBT), efficiently translate data through direct modeling of input-to-output trajectories, achieving robust performance in image and video editing tasks at large scales.

AI-generated summary

We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.

View arXiv page View PDF Project page GitHub 70 Add to collection

Community

adamdad

Paper author Paper submitter 2 days ago

•

edited 2 days ago

🚀 ViBT: The First Vision Bridge Transformer at 20B Parameters
Open-source • Data-to-Data Translation • Built for the Next Generation of Conditional Vision Models

Project Page: https://yuanshi9815.github.io/ViBT_homepage/
Paper: https://arxiv.org/abs/2511.23199
Code: https://github.com/Yuanshi9815/ViBT
HF Demo: https://huggingface.co/spaces/Yuanshi/ViBT
HF Model: https://huggingface.co/Yuanshi/ViBT

blanchon

1 day ago

Let's go ! Look very similar to LBM

adamdad

Paper author 1 day ago

You are right! I guess we’re building on the same Brownian Bridge model paradigm to learn the distribution.

The difference is on the scaling: stabilized training objective, a much larger model with transformers, and more complex tasks. There’s a huge opportunity for bridge models. I guess more will come soon. 🔍