D2D: Detector-to-Differentiable Critic for Improved Numeracy in Text-to-Image Generation
Abstract
A novel framework, Detector-to-Differentiable (D2D), transforms non-differentiable detection models into differentiable critics to improve object counting accuracy in text-to-image diffusion models with minimal impact on image quality.
Text-to-image (T2I) diffusion models have achieved strong performance in semantic alignment, yet they still struggle with generating the correct number of objects specified in prompts. Existing approaches typically incorporate auxiliary counting networks as external critics to enhance numeracy. However, since these critics must provide gradient guidance during generation, they are restricted to regression-based models that are inherently differentiable, thus excluding detector-based models with superior counting ability, whose count-via-enumeration nature is non-differentiable. To overcome this limitation, we propose Detector-to-Differentiable (D2D), a novel framework that transforms non-differentiable detection models into differentiable critics, thereby leveraging their superior counting ability to guide numeracy generation. Specifically, we design custom activation functions to convert detector logits into soft binary indicators, which are then used to optimize the noise prior at inference time with pre-trained T2I models. Our extensive experiments on SDXL-Turbo, SD-Turbo, and Pixart-DMD across four benchmarks of varying complexity (low-density, high-density, and multi-object scenarios) demonstrate consistent and substantial improvements in object counting accuracy (e.g., boosting up to 13.7% on D2D-Small, a 400-prompt, low-density benchmark), with minimal degradation in overall image quality and computational overhead.
Community
We show a new way to convert high-accuracy, but non-differentiable, detectors into differentiable critics to improve numeracy in text-to-image generation.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Vision-Language Models as Differentiable Semantic and Spatial Rewards for Text-to-3D Generation (2025)
- Aligning Visual Foundation Encoders to Tokenizers for Diffusion Models (2025)
- MaskAttn-SDXL: Controllable Region-Level Text-To-Image Generation (2025)
- No Concept Left Behind: Test-Time Optimization for Compositional Text-to-Image Generation (2025)
- Free Lunch Alignment of Text-to-Image Diffusion Models without Preference Image Pairs (2025)
- TTOM: Test-Time Optimization and Memorization for Compositional Video Generation (2025)
- Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper