When Do Diffusion Models learn to Generate Multiple Objects?
Abstract
Diffusion models struggle with multi-object generation due to scene complexity rather than concept imbalance, with counting being particularly challenging in low-data regimes.
Text-to-image diffusion models achieve impressive visual fidelity, yet they remain unreliable in multi-object generation. Despite extensive empirical evidence of these failures, the underlying causes remain unclear. We begin by asking how much of this limitation arises from the data itself. To disentangle data effects, we consider two regimes across different dataset sizes: (1) concept generalization, where each individual concept is observed during training under potentially imbalanced data distributions, and (2) compositional generalization, where specific combinations of concepts are systematically held out. To study these regimes, we introduce mosaic (Multi-Object Spatial relations, AttrIbution, Counting), a controlled framework for dataset generation. By training diffusion models on mosaic, we find that scene complexity plays a dominant role rather than concept imbalance, and that counting is uniquely difficult to learn in low-data regimes. Moreover, compositional generalization collapses as more concept combinations are held out during training. These findings highlight fundamental limitations of diffusion models and motivate stronger inductive biases and data design for robust multi-object compositional generation.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Erasure or Erosion? Evaluating Compositional Degradation in Unlearned Text-To-Image Diffusion Models (2026)
- Repurposing Geometric Foundation Models for Multi-view Diffusion (2026)
- Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation (2026)
- Disentangled Textual Priors for Diffusion-based Image Super-Resolution (2026)
- GramSR: Visual Feature Conditioning for Diffusion-Based Super-Resolution (2026)
- Novel View Synthesis as Video Completion (2026)
- BlendFusion -- Scalable Synthetic Data Generation for Diffusion Model Training (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2605.00273 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper
