Abstract
Bridge Models, instantiated as Vision Bridge Transformer (ViBT), efficiently translate data through direct modeling of input-to-output trajectories, achieving robust performance in image and video editing tasks at large scales.
We introduce Vision Bridge Transformer (ViBT), a large-scale instantiation of Brownian Bridge Models designed for conditional generation. Unlike traditional diffusion models that transform noise into data, Bridge Models directly model the trajectory between inputs and outputs, creating an efficient data-to-data translation paradigm. By scaling these models to 20B and 1.3B parameters, we demonstrate their effectiveness for image and video translation tasks. To support this scale, we adopt a Transformer architecture and propose a variance-stabilized velocity-matching objective for robust training. Together, these advances highlight the power of scaling Bridge Models for instruction-based image editing and complex video translation.
Community
🚀 ViBT: The First Vision Bridge Transformer at 20B Parameters
Open-source • Data-to-Data Translation • Built for the Next Generation of Conditional Vision Models
Project Page: https://yuanshi9815.github.io/ViBT_homepage/
Paper: https://arxiv.org/abs/2511.23199
Code: https://github.com/Yuanshi9815/ViBT
HF Demo: https://huggingface.co/spaces/Yuanshi/ViBT
HF Model: https://huggingface.co/Yuanshi/ViBT
You are right! I guess we’re building on the same Brownian Bridge model paradigm to learn the distribution.
The difference is on the scaling: stabilized training objective, a much larger model with transformers, and more complex tasks. There’s a huge opportunity for bridge models. I guess more will come soon. 🔍
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Visual Generation Tuning (2025)
- Edit2Perceive: Image Editing Diffusion Models Are Strong Dense Perceivers (2025)
- iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation (2025)
- Are Image-to-Video Models Good Zero-Shot Image Editors? (2025)
- OmniRefiner: Reinforcement-Guided Local Diffusion Refinement (2025)
- SplitFlow: Flow Decomposition for Inversion-Free Text-to-Image Editing (2025)
- One-Step Diffusion Transformer for Controllable Real-World Image Super-Resolution (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
