Home > Information > News
#News ·2025-01-07
Share this paper CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation is a new paradigm proposed by Fudan University and ByteDance. Support controllable image generation under the layout based MM-DiT architecture!

Effect example

Layout-to-image (L2I) is a controlled Image generation technique based on Layout information, which includes the spatial position and description of the entity in the image. For example, the user specified the description and spatial location of the entities: Iron Man is standing on a rock holding a drawing board with the words "CreatiLayout" written on the drawing board in hand-drawn font, with the sea and sunset in the background. Layout-to-Image can generate an image that meets the user's needs according to this information.

Layout-to-Image can further release the ability of Text-to-Image model and provide users with more precise control and creative expression channels, which has a wide range of application prospects in game development, animation production, interior design, creative design and other scenes.
The previous Layout-to-Image model mainly has the following problems:
In order to solve the previous approach in the data, model, experience and other aspects of the existing problems, CreatiLayout proposed a targeted solution to achieve higher quality, more controlled layout to image generation.
LayoutSAMCreatiLayout builds links that automatically annotate layouts, and proposes a large-scale layout dataset, LayoutSAM, containing 2.7 million image-text pairs and 10.7 million entity annotations. LayoutSAM is filtered from the SAM dataset and has the characteristics of open set entities, fine-grained annotations, and high image quality. Each entity contains a bounding box and detailed description, covering complex properties such as color, shape, texture, and more. This provides data drive for the model to better understand and learn layout information. Based on this, CreatiLayout built LayoutSAM-Eval, a Layout-to-image generation evaluation benchmark, to comprehensively evaluate the model's performance in layout control, image quality, and text compliance.

SiamLayoutCreatiLayout proposes the SiamLayout framework, which introduces layout information into MM-DiT, effectively alleviates the problem of modal competition, enhances the guiding role of layout, and achieves more accurate layout control compared with other network schemes. Core design points are:

LayoutDesignerCreatiLayout LayoutDesigner was proposed, the use of large language model for layout planning, according to user input (center, mask, sketches, a text description) to generate and optimize the layout, support for a more flexible way of user input, and provide the layout optimization function, For example, add, delete, and modify entities. This makes it easier for users to express their design intentions and generate a more harmonious and beautiful layout.



From fine-grained open set layout to image generation tasks, CreatiLayout is superior to the previous SOTA method in rendering region-level attributes such as spatial positioning, color, texture and shape. In terms of overall image quality, CreatiLayout also shows better visual quality and text compliance. The following visualizations further confirm the benefits of CreatiLayout. For example, the more accurate generation of the text "HELLO FRIENDS" and the generation of pencils and benches in different colors. You can further feel the ability of CreatiLayout in Layout-to-Image on the project demo.



The quantitative and qualitative experiments on layout planning tasks demonstrate the layout generation and optimization capabilities of different layout optimizers under different user input granularity. LayoutDesigner performs well on layout planning tasks based on global titles, center points, and bounding boxes, achieving 100% formatting accuracy, which indicates that it can produce formatting compliant layouts. In addition, to generate images based on the layout planned by LayoutDesigner, you can get higher quality and more aesthetic images. For example, layouts generated by Llama3.1 often lack key elements, while layouts generated by GPT4 often violate fundamental laws of physics, resulting in images generated based on these sub-optimal layouts resulting in poor image quality and low text adherence.
2025-02-17
2025-02-14
2025-02-13
friend link
400-000-0000
立即获取方案或咨询
top