MusePose and Follow-Your-Pose: Tencent's Newly Released Pose-Driven Character Animation Tools

Last month, we introduced several technologies that allow an Avatar to speak or make expressions. In recent days, let's take a look at some technologies for pose-driven characters.

We previously shared some technologies that make images dance 💃, such as:

This time, we will introduce two related technologies released by Tencent:

MusePose

: MusePose is a pose-driven image-to-video framework for virtual human generation.

Link:https://github.com/TMElyralab/MusePose

: Like the previously introduced MuseTalk, it seems to be developed by an internal team at Tencent. MusePose is the last module of the Muse open-source series. Combined with MuseV and MuseTalk, we hope the community can join us in moving towards a vision where virtual humans with full-body motion and interactive capabilities can be generated end-to-end. Please stay tuned for our next milestone!

Scenarios:

Model Architecture:

MusePose is a framework that generates videos from images under control signals (such as poses). The currently released model is an optimized version of Moore-AnimateAnyone, achieving AnimateAnyone functionality.

It also supports ComfyUI: https://github.com/TMElyralab/Comfyui-MusePose

Follow-Your-Pose

: Follow-Your-Pose is the official implementation of the paper "Follow-Your-Pose: Pose-Guided Text-to-Video Generation using Pose-Free Videos."

Link:https://github.com/mayuelala/FollowYourPose

: Developed by the same author as the previously introduced Follow-Your-Emoji, jointly researched by Tsinghua University, Tsinghua Shenzhen International Graduate School, Hong Kong University of Science and Technology (HKUST), and Tencent AI Lab.

Scenarios:

Model Architecture:

The Follow-Your-Pose model architecture includes a two-stage training strategy:

  1. : Train the pose encoder Ep to learn pose control.
  2. : Train the temporal module, including temporal self-attention (SA) and cross-frame self-attention.

During the inference process, a temporally coherent video is generated by providing a description text of the target character and a sequence of action poses. Most parameters of the pre-trained stable diffusion model are frozen, including pseudo-3D convolution layers, cross-attention (CA), and feed-forward network (FFN) modules.