Last month, we introduced several technologies that allow an Avatar to speak or make expressions. In recent days, let's take a look at some technologies for pose-driven characters.
We previously shared some technologies that make images dance 💃, such as:
This time, we will introduce two related technologies released by Tencent:
MusePose
: MusePose is a pose-driven image-to-video framework for virtual human generation.
Link:https://github.com/TMElyralab/MusePose
: Like the previously introduced MuseTalk, it seems to be developed by an internal team at Tencent. MusePose is the last module of the Muse open-source series. Combined with MuseV and MuseTalk, we hope the community can join us in moving towards a vision where virtual humans with full-body motion and interactive capabilities can be generated end-to-end. Please stay tuned for our next milestone!
Scenarios:
Model Architecture:
MusePose is a framework that generates videos from images under control signals (such as poses). The currently released model is an optimized version of Moore-AnimateAnyone, achieving AnimateAnyone functionality.
It also supports ComfyUI: https://github.com/TMElyralab/Comfyui-MusePose
Follow-Your-Pose
: Follow-Your-Pose is the official implementation of the paper "Follow-Your-Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos."
Link:https://github.com/mayuelala/FollowYourPose
: Developed by the same author as the previously introduced Follow-Your-Emoji, jointly researched by Tsinghua University, Tsinghua Shenzhen International Graduate School, Hong Kong University of Science and Technology (HKUST), and Tencent AI Lab.
Scenarios:
Model Architecture:
The Follow-Your-Pose model architecture includes a two-stage training strategy:
: Train the pose encoder Ep to learn pose control. : Train the temporal module, including temporal self-attention (SA) and cross-frame self-attention.
During the inference process, a temporally coherent video is generated by providing a description text of the target character and a sequence of action poses. Most parameters of the pre-trained stable diffusion model are frozen, including pseudo-3D convolution layers, cross-attention (CA), and feed-forward network (FFN) modules.