Animate Anyone - bringing character images to life with animation

A new paper titled 《Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation》 has been released these days. The code hasn't been open-sourced yet, so it can't be used at the moment, but you can read the paper first: https://arxiv.org/abs/2311.17117

Check out the results first

Their method is summarized as follows：

, the pose sequence is first encoded using Pose Guider and fused with multi-frame noise.

, the denoising process of video generation is performed by Denoising UNet. The computation blocks of Denoising UNet consist of spatial attention, cross attention, and temporal attention, as shown in the dashed box on the right. The integration of reference images involves two aspects:

Detailed features are extracted via ReferenceNet and used for spatial attention.
Semantic features are extracted via CLIP image encoder and used for cross attention. Temporal attention operates along the temporal dimension.

The VAE decoder decodes the results into video clips.

Check out different effects：

Real person

Cartoon character

Humanoid

You can also take a look at the comparison of different technical approaches: