arxiv:2507.04151

Unlocking Compositional Control: Self-Supervision for LVLM-Based Image Generation

Published on Jul 5, 2025

Authors:

Abstract

Hi-SSLVLM advances text-to-image synthesis through a two-stage self-supervised learning approach that improves compositional accuracy and semantic consistency without relying on large annotated datasets.

AI-generated summary

This paper introduces Hierarchical Self-Supervised LVLM (Hi-SSLVLM), a novel generative model designed to significantly advance text-to-image synthesis, particularly for complex and compositionally challenging prompts. Traditional methods often grapple with the high cost of meticulously curated paired image-text datasets and struggle with precise control over fine-grained visual attributes and intricate spatial relationships. Our Hi-SSLVLM addresses these limitations through a unique two-stage self-supervised learning strategy. The first stage, Multi-Granularity Visual-Language Grounding, enables the Large Vision-Language Model (LVLM) backbone to autonomously generate and align hierarchical captions (global and local) to images, cultivating a deep internal semantic understanding without reliance on extensive human annotation. The second stage, Self-Refinement and Guided Image Generation, leverages this acquired knowledge by an Internal Compositional Planning (ICP) mechanism, where the LVLM first formulates detailed textual sub-prompts to guide the image generation process, complemented by a novel Semantic Consistency Loss for precise output alignment. Comprehensive experiments against leading baselines, including Janus-Pro-1B, Stable Diffusion XL 1.0, DeepFloyd IF v1.0, and ControlNet-XL, on multi-dimensional benchmarks such as Gemini-2.0-Flash and InternVL3-78B, demonstrate Hi-SSLVLM's superior performance across all fine-grained metrics. An in-depth ablation study confirms the critical role of each proposed component. Furthermore, human evaluations corroborate our quantitative findings, highlighting Hi-SSLVLM's enhanced fidelity to prompt, compositional accuracy, and overall aesthetic quality, marking a significant step towards more controllable and semantically consistent open-ended text-to-image generation.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2507.04151 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2507.04151 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2507.04151 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.