InstructMix2Mix: Consistent Sparse-View Editing Through Multi-View Model Personalization
Abstract
A framework called InstructMix2Mix uses a 2D diffusion model to improve multi-view image editing by leveraging a 3D prior for cross-view consistency.
We address the task of multi-view image editing from sparse input views, where the inputs can be seen as a mix of images capturing the scene from different viewpoints. The goal is to modify the scene according to a textual instruction while preserving consistency across all views. Existing methods, based on per-scene neural fields or temporal attention mechanisms, struggle in this setting, often producing artifacts and incoherent edits. We propose InstructMix2Mix (I-Mix2Mix), a framework that distills the editing capabilities of a 2D diffusion model into a pretrained multi-view diffusion model, leveraging its data-driven 3D prior for cross-view consistency. A key contribution is replacing the conventional neural field consolidator in Score Distillation Sampling (SDS) with a multi-view diffusion student, which requires novel adaptations: incremental student updates across timesteps, a specialized teacher noise scheduler to prevent degeneration, and an attention modification that enhances cross-view coherence without additional cost. Experiments demonstrate that I-Mix2Mix significantly improves multi-view consistency while maintaining high per-frame edit quality.
Community
I-Mix2Mix performs instruction-driven edits on a sparse set of views. The key idea is SDS with a twist: we distill a 2D editor into a pretrained multi-view diffusion model rather than a NeRF/3DGS. The student’s learned 3D prior enables multi-view consistent edits, despite the sparse input.
Check out our project page: https://danielgilo.github.io/instruct-mix2mix/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Coupled Diffusion Sampling for Training-Free Multi-View Image Editing (2025)
- Training-Free Multi-View Extension of IC-Light for Textual Position-Aware Scene Relighting (2025)
- MVCustom: Multi-View Customized Diffusion via Geometric Latent Rendering and Completion (2025)
- EditCast3D: Single-Frame-Guided 3D Editing with Video Propagation and View Selection (2025)
- FlashWorld: High-quality 3D Scene Generation within Seconds (2025)
- CloseUpShot: Close-up Novel View Synthesis from Sparse-views via Point-conditioned Diffusion Model. (2025)
- RapidMV: Leveraging Spatio-Angular Representations for Efficient and Consistent Text-to-Multi-View Synthesis (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper