InstantID maintains high facial fidelity while maintaining compatibility with other models.

After sharing Tencent's PhotoMaker yesterday, let's take a look at another similar project today: InstantID: https://instantid.github.io/

The title of the paper is "InstantID: Zero-shot Identity-Preserving Generation in Seconds".

Introduction

In personalized image synthesis, methods such as Textual Inversion, DreamBooth, and LoRA have made significant progress. However, their real-world applications are limited by high storage requirements, lengthy fine-tuning processes, and the need for multiple reference images. On the other hand, existing ID embedding-based methods, while requiring only a single forward inference, also face challenges: they either require extensive fine-tuning across numerous model parameters, lack compatibility with community pre-trained models, or fail to maintain high facial fidelity. InstantID can serve as a solution to these issues. The plug-and-play module of InstantID can skillfully handle various styles of image personalization using just a single facial image while ensuring high fidelity. To achieve this, the development team designed a novel IdentityNet that applies strong semantic and weak spatial conditions, combining facial and landmark images with text prompts to guide image generation. InstantID demonstrates outstanding performance and efficiency, making it highly valuable in real-world applications where identity preservation is critical. Additionally, InstantID can be seamlessly integrated into popular pre-trained text-to-image diffusion models like SD1.5 and SDXL as an adaptable plugin.

Methodology

The goal of InstantID is to generate customized images with different poses or styles using only one reference identity ID image, while ensuring high fidelity. The figure below provides an overview of the InstantID method. It consists of three key components:

  1. An ID embedding that captures robust semantic facial information;
  2. A lightweight adapter module equipped with decoupled cross-attention, making it easy to use images as visual prompts;
  3. An IdentityNet, which encodes detailed features of the reference facial image and provides additional spatial control.

InstantID differs from previous methods in the following ways:

  1. The UNet is not trained, so the generative capabilities of the original text-to-image model can be retained and is compatible with pre-trained models and ControlNets available in the community;
  2. No adjustment is required at test time, so for a specific character, there is no need to collect multiple images for fine-tuning; just infer a single image once;
  3. Better facial fidelity is achieved while retaining the editability of the text.

Effect

Place your face into any style

InstantID supports both stylized and realistic styles.

Editability and multi-reference images

This demonstrates the robustness, editability, and compatibility of InstantID. The first column shows results with only the image, where the prompt is set to empty during inference. Columns 2-4 demonstrate editability through text prompts. Columns 5-9 show compatibility with existing ControlNets such as canny and depth.

The impact of the number of reference images. For multiple reference images, InstantID takes the average mean of ID embeddings as the image prompt. Even with a single reference image, InstantID achieves good results.

Identity and style interpolation

Interpolate between two different characters.

InstantID also flexibly supports incorporating identity attributes into non-human characters.

Comparison

Compare with the existing state-of-the-art methods that do not require fine-tuning. Specifically, InstantID is compared with IP-Adapter (IPA), IP-Adapter-FaceID, and the recent PhotoMaker. Among them, PhotoMaker requires training LoRA parameters of UNet. It can be seen that both PhotoMaker and IP-Adapter-FaceID achieve good fidelity but show a significant drop in text control capabilities. In contrast, InstantID maintains better fidelity while preserving good text editability (better fusion of face and style).

Comparison of InstantID with pre-trained character LoRAs. InstantID achieves competitive results with LoRAs without any training.

Compare with InstantID from InsightFace Swapper (also known as ROOP or Refactor). However, in non-realistic styles, InstantID is more flexible at blending faces with the background.