Fudan & Byte proposes a new paradigm of layout-to-image, which supports controllable image generation under layout-based MM-DiT architecture!-News-Suzhou xx Education Company Chinese

Fudan & Byte proposes a new paradigm of layout-to-image, which supports controllable image generation under layout-based MM-DiT architecture!

#News ·2025-01-07

Share this paper CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation is a new paradigm proposed by Fudan University and ByteDance. Support controllable image generation under the layout based MM-DiT architecture!

Effect example

Paper introduction

Task background

Layout-to-image (L2I) is a controlled Image generation technique based on Layout information, which includes the spatial position and description of the entity in the image. For example, the user specified the description and spatial location of the entities: Iron Man is standing on a rock holding a drawing board with the words "CreatiLayout" written on the drawing board in hand-drawn font, with the sea and sunset in the background. Layout-to-Image can generate an image that meets the user's needs according to this information.

Layout-to-Image can further release the ability of Text-to-Image model and provide users with more precise control and creative expression channels, which has a wide range of application prospects in game development, animation production, interior design, creative design and other scenes.

The previous Layout-to-Image model mainly has the following problems:

Layout data problem: Existing layout data sets have shortcomings in small-scale data of closed sets and coarse-grained entity annotation, which limits the generalization ability of the model in generating open set entities and the accuracy of generating entities with complex attributes.
Model architecture issues: Previous models focused on U-Net architectures, such as SD1.5 and SDXL. However, with the development of MM-DiT, Vincennes chart models such as SD3 and FLUX have opened up a new level of visual quality and text compliance. Applying the U-Net layout control paradigm directly to MM-DiT will weaken the accuracy of layout control. Therefore, a new framework needs to be designed for MM-DiT to efficiently integrate layout information and realize its full potential.
User experience issues: Many existing methods only support bounding boxes as a way for users to specify the location of entities, and lack the ability to handle more flexible input methods (such as center points, masks, sketches, or just verbal descriptions), limiting the user experience. In addition, these methods do not support optimizations such as adding, removing, or modifying a user's layout.

Method introduction

In order to solve the previous approach in the data, model, experience and other aspects of the existing problems, CreatiLayout proposed a targeted solution to achieve higher quality, more controlled layout to image generation.

Large-scale & fine-grained layout data sets

LayoutSAMCreatiLayout builds links that automatically annotate layouts, and proposes a large-scale layout dataset, LayoutSAM, containing 2.7 million image-text pairs and 10.7 million entity annotations. LayoutSAM is filtered from the SAM dataset and has the characteristics of open set entities, fine-grained annotations, and high image quality. Each entity contains a bounding box and detailed description, covering complex properties such as color, shape, texture, and more. This provides data drive for the model to better understand and learn layout information. Based on this, CreatiLayout built LayoutSAM-Eval, a Layout-to-image generation evaluation benchmark, to comprehensively evaluate the model's performance in layout control, image quality, and text compliance.

A model architecture that treats layout information as a mode

SiamLayoutCreatiLayout proposes the SiamLayout framework, which introduces layout information into MM-DiT, effectively alleviates the problem of modal competition, enhances the guiding role of layout, and achieves more accurate layout control compared with other network schemes. Core design points are:

Consider layout information as an independent mode, as important as text and image modes, and improve the degree of layout information to guide image content
The interaction between the layout mode and the image mode is realized through MM-DiT's native MM-Attention, which retains its advantages in modal interaction
The three modes of image, text and layout are decoupled into two twin branches: image-text interaction branch and image-layout interaction branch, so that text and layout have their respective roles and do not interfere with each other.

Layout designer that supports layout generation and optimization

LayoutDesignerCreatiLayout LayoutDesigner was proposed, the use of large language model for layout planning, according to user input (center, mask, sketches, a text description) to generate and optimize the layout, support for a more flexible way of user input, and provide the layout optimization function, For example, add, delete, and modify entities. This makes it easier for users to express their design intentions and generate a more harmonious and beautiful layout.

Experimental result

Comparison experiment with SOTA method in layout to image generation

From fine-grained open set layout to image generation tasks, CreatiLayout is superior to the previous SOTA method in rendering region-level attributes such as spatial positioning, color, texture and shape. In terms of overall image quality, CreatiLayout also shows better visual quality and text compliance. The following visualizations further confirm the benefits of CreatiLayout. For example, the more accurate generation of the text "HELLO FRIENDS" and the generation of pencils and benches in different colors. You can further feel the ability of CreatiLayout in Layout-to-Image on the project demo.

Comparison with SOTA method in layout generation and optimization

The quantitative and qualitative experiments on layout planning tasks demonstrate the layout generation and optimization capabilities of different layout optimizers under different user input granularity. LayoutDesigner performs well on layout planning tasks based on global titles, center points, and bounding boxes, achieving 100% formatting accuracy, which indicates that it can produce formatting compliant layouts. In addition, to generate images based on the layout planned by LayoutDesigner, you can get higher quality and more aesthetic images. For example, layouts generated by Llama3.1 often lack key elements, while layouts generated by GPT4 often violate fundamental laws of physics, resulting in images generated based on these sub-optimal layouts resulting in poor image quality and low text adherence.

TAGS：

PREV： Top 10 trends in AI and data engineering in 2025

RETURN

NEXT： Beihang | The first multi-function plug and play adapter MV-Adapter: easily achieve multi-view consistent image generation