Following yesterday's two lip-sync projects from Tencent,today I will share three more.
InstructAvatar
: InstructAvatar is a tool for generating Avatars with text-guided emotion and motion control.
Link:https://github.com/wangyuchi369/InstructAvatar
: Peking University
Scene:
Emotional speech control
Facial motion control
Model architecture:
InstructAvatar consists of two components: VAE H for disentangling motion information from videos, and Motion Generator G for generating motion latent variables conditioned on audio and instructions. Since the model has two types of data, two switches are designed in the instructions and audio. During inference, the motion encoder in the VAE will be discarded, and the predicted motion latent variable will be obtained by iterative denoising Gaussian noise. Combined with the user-provided portrait, the final video is generated through the decoder of the VAE.
EMO
: EMO is a tool that generates expressive human portrait videos via an Audio2Video diffusion model under weak conditions.
Link:https://github.com/HumanAIGC/EMO
: Alibaba
Scene:
Singing
Make portraits sing
Different languages and portrait styles
Rap
Speaking
Converse with different characters
Cross-actor performance
Model architecture:
The method framework proposed by EMO mainly includes two stages. In the initial stage, i.e., the frame encoding stage, ReferenceNet extracts features from reference images and motion frames. Subsequently, in the diffusion process stage, the pre-trained audio encoder processes audio embeddings. Facial region masks are combined with multi-frame noise to control the generation of facial images. Next, the backbone network is used for denoising operations. In the backbone network, two attention mechanisms are applied: reference attention and audio attention. These mechanisms are crucial for maintaining character identity and modulating character actions. Additionally, a temporal module is utilized to manipulate the temporal dimension and adjust the speed of motion.
Follow-Your-Emoji
: Follow-Your-Emoji is a diffusion-based portrait animation framework that animates a reference portrait using target landmark sequences. The main challenges in portrait animation are maintaining the identity of the reference portrait, conveying the target expression, while preserving temporal consistency and realism.
Link:https://follow-your-emoji.github.io/
: The University of Hong Kong, Tencent, Tsinghua University
Scene:
Single action + multiple portraits
Single portrait + multiple actions
Model architecture
Firstly, use the landmark encoder to extract features from the expression-aware landmark sequence and fuse these features with multi-frame noise.
Then, randomly mask the frames of the input latent sequence using a progressive strategy.
Finally, concatenate this latent sequence with the fused multi-frame noise and input it into the denoising U-Net for denoising processing to generate the video.
The appearance network and image prompt injection module help the model maintain the identity of the reference portrait, while temporal attention maintains temporal consistency.
During training, the facial fine loss guides the U-Net to focus more on the generation of faces and expressions.
During inference, referring to AniPortrait, use the motion alignment module to align the target landmarks with the reference portrait. Then, generate keyframes first and predict long videos using a progressive strategy.