Adobe proposes InstructMove, which enables instruction-based image editing by observing actions in a video

#News ·2025-01-07

This article is reprinted with the authorization of AIGC Studio public account, please contact the source for reprinting.

InstructMove is an instruction-based image editing model that uses instructions generated by multimodal LLM to train pairs of frames in video. The model excels at non-rigid editing, such as adjusting the subject's posture, expression, and changing the point of view, while maintaining consistency. In addition, the method supports accurate local editing by integrating masks, human postures, and other control mechanisms.

图片

Related link

  • Paper: http://arxiv.org/abs/2412.12087v1
  • Home page: https://ljzycmd.github.io/projects/InstructMove/

Paper introduction

Command-based image manipulation by observing how things move

Abstract

This paper introduces a novel data set construction process that extracts frame pairs from video and generates editing instructions using a multimodal Large language model (MLLM) to train an instruction-based image processing model. Video frames essentially retain the identity of the subject and scene, ensuring consistency in content preservation during editing. In addition, video data captures a variety of natural dynamics (such as non-rigid subject movements and complex camera movements) that are otherwise difficult to model, making it an ideal source for building scalable datasets. Using this approach, we created a new dataset to train InstructMove, a model capable of complex instruction-based operations that are difficult to implement by synthesizing the generated dataset. Our models show state-of-the-art performance for tasks such as adjusting subject poses, rearranging elements, and changing camera perspectives.

method

Data build pipeline:

  1. First sample the appropriate frame pairs from the video to ensure that the conversion is realistic and moderate.
  2. These frames generate detailed editing instructions for the prompt multimodal Large Language model (MLLM).
  3. This process produces a massive data set with realistic image pairs and precise editing instructions.

Overview of model architecture for instruction based image editing. The source and target images are first encoded as potential representations zs and ze using a pre-trained encoder. Then the target potential ze is converted into noise potential Z ET by a forward diffusion process. The source image potential and noise target potential are connected along the width dimension to form the model input, which is input into the de-noised U-Net ϵθ to predict the noise map. The right half of the output (corresponding to the noise target input) is cropped and compared to the original noise map.

result

图片Qualitative comparison with state-of-the-art image editing methods, including description-based and instruction-based methods. Existing methods struggle to handle complex edits, such as non-rigid transformations (such as changes in posture and expression), object repositioning, or viewpoint adjustments. They often either fail to follow editorial instructions or produce inconsistent images, such as identity shifts. In contrast, the paper's approach, trained on real video frames with natural transformations, successfully processed these edits while maintaining consistency with the original input image.

图片Qualitative results of the method and additional control are obtained.

  1. The model can use masks to specify which part of the image to edit, enabling local adjustments and resolving ambiguities in instructions.
  2. When used in combination with ControlNet, the model can accept other inputs, such as human poses or sketches, to enable precise editing of subject poses or object positioning. Previous methods could not achieve this level of control.

conclusion

This paper presents a method to sample video frames and use MLLM to generate editing instructions to train an instruction-based image processing model. Unlike existing datasets that rely on synthetically generated target images, this approach leverages supervised signals from video and MLLM to support complex edits, such as non-rigid transformations and viewpoint changes, while maintaining content consistency. Future work could focus on improving filtering techniques, whether by improving MLLM or incorporating human-computer interaction processes, as well as integrating video data with other datasets to further enhance image editing capabilities.

TAGS:

  • 400-000-0000

  • No. xx, xxx Street, Suzhou City, Jiangsu Province

  • 123123@163.com

  • wechat

  • WeChat official account

Copyright © 2011-2024 苏州竹子网络科技有限公司 版权所有 ICP:苏ICP备88888888号

friend link