Beihang | The first multi-function plug and play adapter MV-Adapter: easily achieve multi-view consistent image generation

#News ·2025-01-07

This article is reprinted with the authorization of AIGC Studio public account, please contact the source for reprinting.

Beihang presented the first multi-functional plug and play adapter MV-Adapter. T2I models and their derived models can be enhanced without changing the original network structure or feature space. The MV-Adapter achieves multi-view image generation up to 768 resolution on SDXL and demonstrates excellent adaptability and versatility. It can also be extended to arbitrary view generation, opening new doors for a wider range of applications.

Line 1 of the figure below shows the results of integrating the MV-Adapter with the personalized T2I, the refined Few-step T2I, and ControlNet, demonstrating its adaptability. Line 2 shows the results under various control signals, including view-guided or geometrically guided generation using text or image input, demonstrating its versatility.

图片

Related link

  • Code: https://github.com/huanngzh/MV-Adapter
  • Paper: https://arxiv.org/abs/2412.03632
  • Home page: https://huanngzh.github.io/MV-Adapter-Page/
  • Try: https://huggingface.co/spaces/VAST-AI/MV-Adapter-I2MV-SDXL
  • ComfyUI:https://github.com/huanngzh/ComfyUI-MVAdapter

Paper introduction

图片MV-Adapter: Easily generate consistent images from multiple views

Abstract

Existing multi-view image generation methods often make invasive modifications to pre-trained text-to-image (T2I) models and require complete fine-tuning, resulting in the following issues:

  1. Computationally expensive, especially for large base models and high-resolution images.
  2. Image quality degrades due to difficult optimization and scarcity of high-quality 3D data.

The paper presents the first adapter-based multi-view image generation solution and introduces the MV-Adapter, a versatile plug-and-play adapter that enhances T2I models and their derivatives without changing the original network structure or feature space. By updating fewer parameters, MV-Adapter enables efficient training and preserves the prior knowledge embedded in the pre-trained model, reducing the risk of overfitting.

To effectively model 3D geometry knowledge in adapters, the paper introduces innovative designs, including repeated self-attention layers and parallel attention architectures, that enable adapters to inherit powerful priors from pre-trained models to model novel 3D knowledge. In addition, a unified conditional encoder is proposed that seamlessly integrates camera parameters and geometric information, facilitating text - and image-based 3D generation and texturing applications.

The MV-Adapter enables 768 resolution multi-view generation on the Stable Diffusion XL (SDXL) and demonstrates adaptability and versatility. It can also be extended to arbitrary view generation, enabling a wider range of applications. MV-Adapter sets new quality standards for multi-view image generation and opens up new possibilities due to its efficiency, adaptability and versatility.

Method introduction

图片MV-Adapter is a plug-and-play adapter that learns multi-view prior, transfers to derivatives of T2I models without special adjustments, and enables T2I to generate multi-view consistent images under a variety of conditions. When reasoning, the MV-Adapter contains a conditional bootstrap (yellow) and decouple attention layer (blue) that can be inserted directly into a personalized or distilled T2I to form a multi-view generator.

图片The MV-Adapter consists of two parts:

  1. A condition guide that encodes camera conditions or geometric conditions;
  2. A decouple attention layer with a multi-view attention layer for learning multi-view consistency, and an optional image cross attention layer to support image condition generation

A pre-trained U-Net is used to encode the reference image to extract fine-grained information.

Result presentation

Text to multiple views

图片

Image to multiple views

图片

Sketch to Multiple views (using ControlNet)

图片

Text condition 3D generation

图片

Image condition 3D generation

图片

Text conditional texture generation

图片

Image condition texture generation

图片

ComfyUI trial

Integrating the MV-Adapter into ComfyUI allows users to generate multi-view consistent images from text prompts or a single image directly within the ComfyUI interface. See the link above for details.

  • Supports integration with SDXL LoRA
  • Generate multi-view consistent images based on text prompts or a single image

图片图片

TAGS:

  • 400-000-0000

  • No. xx, xxx Street, Suzhou City, Jiangsu Province

  • 123123@163.com

  • wechat

  • WeChat official account

Copyright © 2011-2024 苏州竹子网络科技有限公司 版权所有 ICP:苏ICP备88888888号

friend link