MimicBrush: Zero-shot Image Editing and Reference Imitation

Take a look at the project MimicBrush released by Alibaba last month. The project showcases the diverse editing results produced by MimicBrush, where users only need to specify the area to be edited in the source image (i.e., the white mask) and provide a reference image from a natural setting, indicating the expected outcome after editing. The MimicBrush model can automatically capture the semantic correspondence between the two and complete the edit in one execution.

Three use cases

1. Local area editing

2. Texture transfer

3. Post-processing optimization

Running

A demo can be run on HuggingFace; I changed this guy's hat to green👒:

Features

Image editing is a practical yet challenging task, considering the diverse needs of users, with one of the most difficult parts being accurately describing what the edited image should look like. In this work, MimicBrush proposes a new form of editing called mimicry editing to help users more conveniently unleash their creativity. Specifically, to edit an area of interest in an image, users can directly draw inspiration from some wild references (e.g., relevant images encountered online), without having to deal with the adaptation issues between the reference and the source image. This design requires the system to automatically understand the intended effect of the reference image to carry out the editing. The MimicBrush framework randomly selects two frames from a video clip, masks some areas of one frame, and uses information from the other frame to restore the masked areas. Thus, the MimicBrush model, in a self-supervised manner, is able to capture the semantic correspondence between different images.

Technology

The training process of MimicBrush is as follows:

First, we randomly select two frames from a video sequence as the reference image and the source image. Then, the source image is masked and data augmentation is applied. Next, we input the noisy image latent variables, the mask, the background latent variables, and the depth latent variables into the mimic U-Net. The reference image is also augmented and sent to the reference U-Net. These two U-Nets are trained to restore the masked area of the source image. The attention keys and values from the reference U-Net are concatenated with the mimic U-Net to assist in the synthesis of the masked area.

Comparison