Two models from Tencent that make Avatars talk: V-Express and MuseTalk

Several technologies that enable Avatars to start speaking have been introduced before:

Today, we will share two more from Tencent:

V-Express

: V-Express aims to generate a talking-head video controlled by a reference image, audio, and a series of V-Kps images.

Link:https://github.com/tencent-ailab/V-Express

: Tencent

3 Scenarios

  1. If there is a photo of A and another speaking video of A in a different scene, the model can generate a speaking video consistent with the given video.

  2. If there is only one photo and any speaking audio, the model can generate vivid mouth movements for a fixed face.

  3. Scenario 3 (A's photo and B's speaking video.)

  • The model can generate vivid mouth movements for a fixed face.

  • The model generates vivid mouth movements accompanied by slight facial movements.

  • The model generates a video with the same actions as the target video, where the character's lip shapes are synchronized to match the target audio.

Model Architecture

The backbone of V-Express is a denoising U-Net that denoises multi-frame noisy latent variables under specific conditions. The architecture of this denoising U-Net is very similar to SDv1.5, with the main difference being four attention layers in each Transformer block instead of two. The first attention layer is a self-attention layer, just like in SDv1.5. The second and third attention layers are cross-attention layers. The second attention layer is called the reference attention layer, used to encode the relationship with the reference image. The third attention layer is called the audio attention layer, used to encode the relationship with the audio. These three attention layers are all spatial attention layers. Finally, the fourth attention layer is called the motion attention layer, which is a temporal self-attention layer used to capture the temporal relationships between video frames.

In addition, V-Express contains three key modules: ReferenceNet, V-Kps Guider, and Audio Projection, which are used respectively to encode the reference image, V-Kps images, and audio.

MuseTalk

: MuseTalk is a real-time high-quality lip synchronization tool achieved through latent space patching.

Link:https://github.com/TMElyralab/MuseTalk

: Tencent

Scenario

  • MuseV + MuseTalk brings portrait photos to life!
  • Video dubbing
  • Some interesting videos!

Model Architecture

MuseTalk is trained in the latent space, where images are encoded by a frozen VAE and audio is encoded by a frozen whisper-tiny model. The architecture of the generative network draws inspiration from the UNet of stable-diffusion-v1-4, fusing audio embeddings into image embeddings through cross-attention. Although the architecture used by MuseTalk is very similar to Stable Diffusion, what makes MuseTalk unique is that it is not a diffusion model but operates through single-step patching in the latent space.