Making Avatar Move - InstructAvatar, EMO, Follow-Your-Emoji

Following yesterday's two lip-sync projects from Tencent,today I will share three more.

InstructAvatar

: InstructAvatar is a tool for generating Avatars with text-guided emotion and motion control.

Link:https://github.com/wangyuchi369/InstructAvatar

: Peking University

Scene

  • Emotional speech control

  • Facial motion control

Model architecture

InstructAvatar consists of two components: VAE H for disentangling motion information from videos, and Motion Generator G for generating motion latent variables conditioned on audio and instructions. Since the model has two types of data, two switches are designed in the instructions and audio. During inference, the motion encoder in the VAE will be discarded, and the predicted motion latent variable will be obtained by iterative denoising Gaussian noise. Combined with the user-provided portrait, the final video is generated through the decoder of the VAE.

EMO

: EMO is a tool that generates expressive human portrait videos via an Audio2Video diffusion model under weak conditions.

Link:https://github.com/HumanAIGC/EMO

: Alibaba

Scene

  1. Singing
  • Make portraits sing

  • Different languages and portrait styles

  • Rap

  1. Speaking
  • Converse with different characters

  • Cross-actor performance

Model architecture

The method framework proposed by EMO mainly includes two stages. In the initial stage, i.e., the frame encoding stage, ReferenceNet extracts features from reference images and motion frames. Subsequently, in the diffusion process stage, the pre-trained audio encoder processes audio embeddings. Facial region masks are combined with multi-frame noise to control the generation of facial images. Next, the backbone network is used for denoising operations. In the backbone network, two attention mechanisms are applied: reference attention and audio attention. These mechanisms are crucial for maintaining character identity and modulating character actions. Additionally, a temporal module is utilized to manipulate the temporal dimension and adjust the speed of motion.

Follow-Your-Emoji

: Follow-Your-Emoji is a diffusion-based portrait animation framework that animates a reference portrait using target landmark sequences. The main challenges in portrait animation are maintaining the identity of the reference portrait, conveying the target expression, while preserving temporal consistency and realism.

Link:https://follow-your-emoji.github.io/

: The University of Hong Kong, Tencent, Tsinghua University

Scene

  • Single action + multiple portraits

  • Single portrait + multiple actions

Model architecture

Firstly, use the landmark encoder to extract features from the expression-aware landmark sequence and fuse these features with multi-frame noise.

Then, randomly mask the frames of the input latent sequence using a progressive strategy.

Finally, concatenate this latent sequence with the fused multi-frame noise and input it into the denoising U-Net for denoising processing to generate the video.

The appearance network and image prompt injection module help the model maintain the identity of the reference portrait, while temporal attention maintains temporal consistency.

During training, the facial fine loss guides the U-Net to focus more on the generation of faces and expressions.

During inference, referring to AniPortrait, use the motion alignment module to align the target landmarks with the reference portrait. Then, generate keyframes first and predict long videos using a progressive strategy.