arxiv:2509.21953

MultiCrafter: High-Fidelity Multi-Subject Generation via Spatially Disentangled Attention and Identity-Aware Reinforcement Learning

Published on Sep 26

· Submitted by

Suger on Sep 30

Upvote

Authors:

Tao Wu ,

Abstract

MultiCrafter framework improves multi-subject image generation by addressing attribute leakage through explicit positional supervision, utilizing a Mixture-of-Experts architecture, and aligning with human preferences via online reinforcement learning.

AI-generated summary

Multi-subject image generation aims to synthesize user-provided subjects in a single image while preserving subject fidelity, ensuring prompt consistency, and aligning with human aesthetic preferences. However, existing methods, particularly those built on the In-Context-Learning paradigm, are limited by their reliance on simple reconstruction-based objectives, leading to both severe attribute leakage that compromises subject fidelity and failing to align with nuanced human preferences. To address this, we propose MultiCrafter, a framework that ensures high-fidelity, preference-aligned generation. First, we find that the root cause of attribute leakage is a significant entanglement of attention between different subjects during the generation process. Therefore, we introduce explicit positional supervision to explicitly separate attention regions for each subject, effectively mitigating attribute leakage. To enable the model to accurately plan the attention region of different subjects in diverse scenarios, we employ a Mixture-of-Experts architecture to enhance the model's capacity, allowing different experts to focus on different scenarios. Finally, we design a novel online reinforcement learning framework to align the model with human preferences, featuring a scoring mechanism to accurately assess multi-subject fidelity and a more stable training strategy tailored for the MoE architecture. Experiments validate that our framework significantly improves subject fidelity while aligning with human preferences better.

View arXiv page View PDF Project page Add to collection

Community

SugerWu

Paper author Paper submitter Sep 30

•

edited Sep 30

This paper introduces MultiCrafter, a framework for multi-subject driven customized image generation.

🔆 Motivation

Existing In-Context Learning based methods suffer from two main issues:

Attribute Leakage: When generating multiple subjects, their distinct features (like facial characteristics or clothing) often blend and fuse, a problem the authors attribute to "attention bleeding" between subjects. This severely compromises the identity and fidelity of each individual subject.
Failure to Align with Human Preferences: Models trained with simple reconstruction objectives do not capture nuanced human preferences for aesthetic quality or precise alignment with text prompts.

😉 The MultiCrafter Solution

To tackle these challenges, MultiCrafter proposes three core innovations:

Identity-Disentangled Attention Regularization: This technique uses explicit positional supervision to force the model to learn separate attention regions for each subject, effectively reducing attribute leakage and enhancing subject fidelity.
Efficient Adaptive Expert Tuning: It incorporates a Mixture-of-Experts (MoE-LORA) architecture to increase the model's capacity to handle diverse subjects and spatial layouts without increasing inference overhead.
Identity-Preserving Preference Optimization: The framework designs a novel online reinforcement learning stage to align outputs with human preferences. This features a "Multi-ID Alignment Reward" for accurate fidelity scoring and uses the stable Group Sequence Policy Optimization (GSPO) algorithm for training.

Project Page: https://wutao-cs.github.io/MultiCrafter/

librarian-bot

Oct 1

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2509.21953 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2509.21953 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2509.21953 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.