Yesterday, Meta released Audio2Photoreal. The code, dataset, demo, and paper are all available.
https://people.eecs.berkeley.edu/~evonne_ng/projects/audio2photoreal/
Introduction
The paper proposes a framework for generating photorealistic full-body virtual humans that can dynamically and naturally gesture based on dialogues in two-person interactions, all driven by the speaker's voice.
Through voice input, various gestural motion possibilities can be generated for virtual humans, including facial, body, and hand gestures. The key to this paper's method lies in combining the sample diversity of vector quantization with the high-frequency details obtained through diffusion, resulting in more dynamic and expressive motions. Highly realistic virtual humans are used to visualize these generated motions, capable of expressing critical nuances in gestures (e.g., smirking vs. smiling).
An innovative multi-view dialogue dataset is introduced for photorealistic reconstruction. Experiments show that the model can generate appropriate and diverse gestures, outperforming methods that rely solely on diffusion or vector quantization. Additionally, perceptual evaluations highlight the importance of photorealism (compared to meshes) in accurately assessing subtle motion details in conversational gestures.
Method Overview
Our method takes dialogue audio as input and generates corresponding facial encodings and body gesture poses. Then, these output motions are fed into our trained Avatar renderer to generate a realistic video.
Motion Generation
(a) Given dialogue audio A, we use a diffusion network influenced by the outputs of an audio and lip regression network L to generate facial actions F, which can predict synchronized lip geometry from speech audio.
(b) For body-hand poses, we first autoregressively generate guiding poses P at a low frame rate using a VQ-Transformer.
(c) Then, the pose diffusion model utilizes these guiding poses and audio to generate high-frequency motion sequences J.
Diversity of guiding pose sequences
Based on the input audio of the dialogue (the predicted character audio marked in gold), transformer P generates diverse samples of guiding pose sequences with varying auditory reactions (top), speaking gestures (middle), and interjections (bottom).
By sampling from a rich learned pose codebook, P can produce "extreme" poses with high diversity between samples, such as pointing, scratching, clapping, etc.
These diverse poses are then used to condition the body diffusion model J.
Result
generated gestures synchronized with the dialogue audio:
During the character's listening period (top), our model accurately produces static poses that look like the Avatar is attentively listening.
In contrast, during the speaking period (bottom), the model generates a variety of gestures that move in sync with the audio.
Compare different methods
Correlation between audio and motion Given the audio (top), we plot the L2 distance of each pose from the average neutral pose over 400 frames. The rendered avatars, with orange lines, closely match the large motion spikes that are also visible in reality (e.g., a hand flick coinciding with an "ugh" sound). LDA [2] (pink) fails to capture these sharp motion spikes.
Demo
Demo I also ran it myself: https://colab.research.google.com/drive/1lnX3d-3T3LaO3nlN6R8s6pPvVNAk5mdK?usp=sharing