Baidu's Hallo - Voice-driven Portrait Image Animation

Today, I'll take a look at Baidu's Hallo.

The proposed network architecture seamlessly integrates diffusion-based generative models, UNet-based denoisers, temporal alignment techniques, and reference networks. The proposed hierarchical audio-driven visual synthesis provides adaptive control over expression and pose diversity, making personalization for different identities more effective.

Scene

Tribute to classic films
Virtual characters
Real characters
Motion control (pose, expression, lips)
Singing
Across actors

Trial use

The model can be run on Huggingface.

Technology

Specifically, Hallo integrates a reference image containing a portrait with corresponding audio input to drive the animation of the portrait. Optional visual synthesis weights can be used to balance the weights of lips, expressions, and poses. The ReferenceNet encodes global visual texture information to achieve consistent and controllable character animation. The face and audio encoders separately generate high-fidelity portrait identity features and encode the audio into motion information. The hierarchical audio-driven visual synthesis module establishes the relationship between audio and visual components (lips, expressions, poses), using a UNet denoiser in the diffusion process.

Comparison

Quantitative comparison of existing portrait image animation methods on the HDTF dataset. The method proposed by Hallo performs excellently in generating high-quality, temporally coherent talking head animations with superior lip-sync performance.

Qualitative comparison with existing methods on the HDTF dataset.