Microsoft released a paper at the end of last month titled "GAIA: ZERO-SHOT TALKING AVATAR GENERATION," which presents Microsoft's lip-sync video generation technology. The paper is available here: https://arxiv.org/abs/2311.15230.
The Demo and Code website is currently inaccessible, but you can view some examples.
How it works
The working principle of GAIA is illustrated in the following figure:
GAIA consists of a VAE (Variational Autoencoder) and a diffusion model. The VAE is used to encode each video frame into a disentangled representation (i.e., motion and appearance representation) and reconstruct the original frame from this disentangled representation. Then, the diffusion model is optimized to generate motion sequences conditioned on random frames from the audio sequence and video clip. During inference, the diffusion model takes the input audio sequence and reference portrait image as conditions, generating a motion sequence that is then decoded into a video using the VAE decoder.
Effect demonstration
GAIA performs a qualitative comparison with state-of-the-art audio-based methods. The results show that GAIA achieves higher levels in terms of naturalness, lip-sync quality, visual quality, and motion diversity. In contrast, other baseline methods often overly rely on the reference image, making them prone to generating slight movements (for example, when the eyes in the reference image are closed, most baseline methods produce closed-eye results) or inaccurate lip-syncs.
Comparison with other technologies
Below is a comparison of GAIA with other technologies:
Naturalness: GAIA outperforms others, producing more natural videos. Lip-sync quality: GAIA surpasses other technologies, achieving better alignment between lip movement and speech. Visual quality: GAIA offers higher visual quality with clearer details. Motion diversity: GAIA excels in motion diversity, creating more vivid and dynamic videos.