DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Abstract
The frequency-DeCoupled pixel diffusion framework improves image generation efficiency and quality by separating high-frequency details and low-frequency semantics, achieving superior performance compared to existing pixel diffusion models.
Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This approach avoids the limitations of VAE in the two-stage latent diffusion, offering higher model capacity. Existing pixel diffusion models suffer from slow training and inference, as they usually model both high-frequency signals and low-frequency semantics within a single diffusion transformer (DiT). To pursue a more efficient pixel diffusion paradigm, we propose the frequency-DeCoupled pixel diffusion framework. With the intuition to decouple the generation of high and low frequency components, we leverage a lightweight pixel decoder to generate high-frequency details conditioned on semantic guidance from the DiT. This thus frees the DiT to specialize in modeling low-frequency semantics. In addition, we introduce a frequency-aware flow-matching loss that emphasizes visually salient frequencies while suppressing insignificant ones. Extensive experiments show that DeCo achieves superior performance among pixel diffusion models, attaining FID of 1.62 (256x256) and 2.22 (512x512) on ImageNet, closing the gap with latent diffusion methods. Furthermore, our pretrained text-to-image model achieves a leading overall score of 0.86 on GenEval in system-level comparison. Codes are publicly available at https://github.com/Zehong-Ma/DeCo.
Community
DeCo: Frequency-Decoupled Pixel Diffusion for End-to-End Image Generation
Arxiv: https://arxiv.org/abs/2511.19365
Project Page: https://zehong-ma.github.io/DeCo
Code Repository: https://github.com/Zehong-Ma/DeCo
Huggingface Space: https://14467288703cf06a3c.gradio.live/
🖼️ Background
- Pixel diffusion aims to generate images directly in pixel space in an end-to-end fashion. This avoids the two-stage training and inevitable low-level artifacts of VAE.
- Current pixel diffusion models suffer from slow training since a single Diffusion Transformer (DiT) is required to jointly model complex high-frequency signals and low-frequency semantics. Modeling complex high-frequency signals, especially high-frequency noise, can distract the DiT from learning low-frequency semantics.
- JiT proposes that high-dimensional noise may distract the model from learning low-dimensional data, which is also a form of high-frequency interference. Additionaly, the intrinsic noise (e.g., camera noise) in the clean image is also high-frequency noise that requires modeling. Our DeCO can jointly models these high-frequency signals (gaussian noise in JiT, intrinsic camera noise, high-frequency details) in an end-to-end manner.
- Motivation: The paper proposes the frequency-DeCoupled (DeCo) framework to separate the modeling of high and low-frequency components. A lightweight Pixel Decoder is introduced to model the high-frequency components , thereby freeing the DiT to specialize in modeling low-frequency semantics.
💡Method
- The DiT operates on a downsampled, low-resolution input to generate low-frequency semantic conditions. The Pixel Decoder then takes the full-resolution input, and use the DiT's semantic condition as guidance to predict the velocity. The AdaLN-Zero interaction mechanism is used to modulate the dense features in the Pixel Decoder with the DiT output.
- The paper also propose a frequency-aware flow-matching loss。It applies adaptive weights for different frequency components. These weights are derived from normalized reciprocal of JPEG quantization tables , which assign higher weights to perceptually more important low-frequency components and suppress insignificant high-frequency noise.
📈Experiments
- The authors trained the DeCo-XL model with a DiT patch size of 16 on the ImageNet 256x256 and 512x512. DeCo-XL achieves a leading FID of 1.62 on ImageNet 256x256 and 2.22 on ImageNet 512x512. With the same 50 Heun steps at 600 epochs, DeCo's FID of 1.69 is superior to JiT's FID of 1.86.
- For scaling ability in text-to-image generation, a DeCo-XXL model was trained on the BLIP3o dataset (36M pretraining images + 60k instruction-tuning data). It achieves an overall score of 0.86 on GenEval and a competitive average score of 81.4 on DPG-Bench.
Pretty cool and possibly can be generalized on higher resolutions. Are there any plans to expand this further? ControlNets maybe? all recent SOTA models completely failing with precise adhering to control signals, SDXL is the last model capable to do it properly... and this approach seems to be able to incorporate ControlNets naturally
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper

