Tencent's PhotoMaker - Faster, More Realistic, and More Controllable AI Avatars

A few days ago, Boss Shenshen shared the PhotoMaker and InstantID projects with me. Today, let's look at PhotoMaker first. Project address: https://github.com/TencentARC/PhotoMaker

The research team mainly comes from: Nankai University, ARC Lab of Tencent PCG, and the University of Tokyo.

The title of the paper is "PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding."

Introduction

Text-to-image generation technology has made significant progress in synthesizing realistic human photos based on text prompts. However, existing personalized generation methods fail to meet the three requirements of high efficiency, ID fidelity, and flexible text controllability simultaneously. Tencent's PhotoMaker can achieve this. It mainly encodes an arbitrary number of input ID images through stacked ID embedding to retain ID information. Such embedding serves as a unified ID representation, which not only comprehensively encapsulates the features of the same input ID but also accommodates the features of different IDs for subsequent integration.

Method

Our method converts several input images with the same identity into a stacked ID embedding. This embedding can be regarded as a unified representation of the identity to be generated. During the inference stage, the images constituting the stacked ID embedding can come from different identities. Subsequently, we can synthesize these customized identities in various contexts.

We obtain text embeddings and image embeddings separately from the text encoder and the image encoder.
We extract fused embeddings by merging the corresponding category embeddings (e.g., male and female) with each image embedding.
We concatenate all fused embeddings along the length dimension to form the stacked ID embedding.
We input the stacked ID embeddings into all cross-attention layers to adaptively fuse ID content in the diffusion model.

It is worth noting that, while we use images of the same ID with masked backgrounds during training, we can directly input images of different IDs without background distortion during inference, thus creating a new identity.

Effect

Reconstruction

We demonstrate the generation capability of our PhotoMaker under basic prompts. We show the prompts that inspired the creation below each image.

Bringing characters from artworks/old photos into reality

By taking an artwork, sculpture, or an old photo of someone as input, our PhotoMaker can bring characters from the last century or even ancient times to the present day, "taking photos" for them. We display the prompts that inspired the creation below each image.

Stylization

Our PhotoMaker not only has the ability to generate realistic human photographs but can also achieve stylization while preserving identifying features. We showcase the prompts that inspired the creation in the first row.

Change age or gender

By simply replacing category words (e.g., male and female), our method can achieve changes in gender and age while maintaining the original identity.

Identity Mixing

If the user provides images with different identity markers as input, our PhotoMaker can effectively integrate the features of different identity markers to form a new identity marker.

For identity mixing, the PhotoMaker approach can adjust the merging ratio either by controlling the proportion of identity images in the input image pool or through prompt weighting.

First, let's take a look at how the PhotoMaker method customizes new identity markers by controlling the proportions of different identity markers in the input image pool.

Then, PhotoMaker multiplies the embedding of images associated with a specific identity identifier by a coefficient to control its blending ratio in the new identity.

Comparison

Compared with other methods, PhotoMaker can simultaneously meet the generation capabilities of high quality and diversity, promising editability, high inference efficiency, and strong identity fidelity.