Geometry-Guided Reinforcement Learning for Multi-view Consistent 3D Scene Editing
Abstract
RL3DEdit uses reinforcement learning with rewards from a 3D foundation model to achieve multi-view consistent 3D editing from 2D editing priors.
Leveraging the priors of 2D diffusion models for 3D editing has emerged as a promising paradigm. However, maintaining multi-view consistency in edited results remains challenging, and the extreme scarcity of 3D-consistent editing paired data renders supervised fine-tuning (SFT), the most effective training strategy for editing tasks, infeasible. In this paper, we observe that, while generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning (RL) as a feasible solution. Motivated by this, we propose RL3DEdit, a single-pass framework driven by RL optimization with novel rewards derived from the 3D foundation model, VGGT. Specifically, we leverage VGGT's robust priors learned from massive real-world data, feed the edited images, and utilize the output confidence maps and pose estimation errors as reward signals, effectively anchoring the 2D editing priors onto a 3D-consistent manifold via RL. Extensive experiments demonstrate that RL3DEdit achieves stable multi-view consistency and outperforms state-of-the-art methods in editing quality with high efficiency. To promote the development of 3D editing, we will release the code and model.
Community
"While generating multi-view consistent 3D content is highly challenging, verifying 3D consistency is tractable, naturally positioning reinforcement learning as a feasible solution."
----RL3DEdit
Paper: https://arxiv.org/abs/2603.03143
Project Page: https://amap-ml.github.io/RL3DEdit/
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CoreEditor: Correspondence-Constrained Diffusion for Consistent 3D Editing (2025)
- ShapeUP: Scalable Image-Conditioned 3D Editing (2026)
- DiffStyle3D: Consistent 3D Gaussian Stylization via Attention Optimization (2026)
- Intrinsic Geometry-Appearance Consistency Optimization for Sparse-View Gaussian Splatting (2026)
- One2Scene: Geometric Consistent Explorable 3D Scene Generation from a Single Image (2026)
- AnchoredDream: Zero-Shot 360° Indoor Scene Generation from a Single View via Geometric Grounding (2026)
- VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Good paper!
But if our 2D editing models are now being supervised entirely by the confidence maps of frozen 3D foundation models (like VGGT), aren't we just shifting the bottleneck? While the authors note that data-driven verifiers are harder to "reward-hack" than traditional methods, how can we prevent the RL policy from eventually exploiting the verifier's blind spots?
And what happens to our edits when the 3D verifier itself has inherent geometric biases for out-of-distribution scenes?
One thing I'm a bit worried about: what happens when VGGT hits tricky, highly reflective textures?
If the internal pose estimator gets tripped up there, I'm concerned the RL policy might just take the easy way out. Could it end up spawning adversarial artifacts just to game the confidence score, rather than actually learning true 3D consistency?
Thank you for your interest and insightful comments! You are right. In our experiments, we indeed observed that in certain scenarios, VGGT does not output the stable confidence maps that reflect 3D consistency as we might hope. However, the priors learned by VGGT from millions of training images remain exceptionally powerful. While imperfect, they are more than sufficient to drive meaningful progress in 3D editing, especially given the current scarcity of 3D data.
Our empirical experiment results demonstrate that the RL policy can achieve 3D-consistent outputs. Regarding the more complex corner cases you mentioned, we believe addressing them might be slightly premature for the field right now. Currently, 3D editing models still struggle with executing complex editing instructions even in simple scenes. It may be more practical to further mature the foundational capabilities in the field of 3D editing before dealing with these edge situations.
Thank you!
that anchor-based rl loop, where an anchor image preserves editing priors while vggt rewards pull the other views toward 3d coherence, is the neat trick here. my big worry is how sensitive the pipeline is to vggt's quality in scenes with shiny metals, translucence, or heavy occlusion where depth and pose signals can be noisy. an ablation on which vggt cues matter most (depth vs pose confidence vs confidence maps) would really help tease apart where the gains come from. i also wonder how this scales to scenes with many views or dynamic edits, given the single-pass design. btw the arxivlens breakdown does a nice job unpacking section 3 and the reward shaping they rely on.
Thanks for the interest and the insightful comments!
Regarding the robustness, compared to traditional reprojection loss, VGGT is pre-trained on massive datasets that inherently cover the challenging cases you mentioned (e.g., shiny metals and heavy occlusion), making it more robust overall. Naturally, there are still some cases where it might fall short, but as we noted, the successful convergence of the RL training validates the feasibility of VGGT.
To clarify the ablation study part: we actually don't use depth maps or 'pose confidence' in our pipeline. Instead, we have included individual ablations specifically on depth/point cloud confidence and pose error in the main paper.
As for scaling to scenes with many views, we've discussed two potential strategies in Section 5. Moving to dynamic editing scenarios is an interesting point—this might be more effectively addressed by incorporating video editing models to assist the pipeline, which we leave as an avenue for future work.
We really appreciate your recognition of our work and the ArxivLens breakdown!
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper