Byte's Loopy audio-driven portrait animation generation model

Loopy is an audio-driven avatar animation generation model jointly developed by ByteDance and Zhejiang University. Currently, there is no code available, only the paper - https://loopyavatar.github.io/

Loopy is an end-to-end audio-only conditional video diffusion model. It can utilize long-term motion information from data, learn natural motion patterns, and improve the correlation between audio and avatar motion. This method eliminates the need for manually specified spatial motion templates commonly used in existing methods to constrain the inference process, thereby generating more realistic and high-quality results in various scenarios.

Examples of generated videos

Technical framework of Loopy

This framework removes the facial localizer and velocity layer modules commonly used in existing methods. Instead, flexible and natural motion generation is achieved through the proposed cross-fragment/intra-fragment temporal layers and audio-to-latent representation module.

Comparison with other methods