Today, I'll take a look at Baidu's Hallo.
The proposed network architecture seamlessly integrates diffusion-based generative models, UNet-based denoisers, temporal alignment techniques, and reference networks. The proposed hierarchical audio-driven visual synthesis provides adaptive control over expression and pose diversity, making personalization for different identities more effective.
Scene
Tribute to classic films Virtual characters Real characters Motion control (pose, expression, lips) Singing Across actors
Trial use
The model can be run on Huggingface.
Technology
Specifically, Hallo integrates a reference image containing a portrait with corresponding audio input to drive the animation of the portrait. Optional visual synthesis weights can be used to balance the weights of lips, expressions, and poses. The ReferenceNet encodes global visual texture information to achieve consistent and controllable character animation. The face and audio encoders separately generate high-fidelity portrait identity features and encode the audio into motion information. The hierarchical audio-driven visual synthesis module establishes the relationship between audio and visual components (lips, expressions, poses), using a UNet denoiser in the diffusion process.
Comparison
Quantitative comparison of existing portrait image animation methods on the HDTF dataset. The method proposed by Hallo performs excellently in generating high-quality, temporally coherent talking head animations with superior lip-sync performance.
Qualitative comparison with existing methods on the HDTF dataset.