Alibaba's DreaMoving: A character video generation framework based on diffusion models.

Alibaba released a paper last month titled "DreaMoving: A Diffusion Model-Based Framework for Human Video Generation."

Although the code is not provided, both the paper and demonstration videos have been released. https://dreamoving.github.io/dreamoving/

Abstract

DreaMoving is a diffusion model-based controllable video generation framework designed to produce high-quality customized human videos. Specifically, given a target identity and a sequence of poses, DreaMoving can generate a video showing the target identity dancing according to these pose sequences in any scene. To achieve this, DreaMoving proposes a "Video ControlNet" for motion control and a "Content Guider" for maintaining identity consistency. The model proposed by DreaMoving is easy to use and can be adapted to most stylized diffusion models to generate diverse results.

Four input methods

  • The DreaMoving result of the input text prompt.


  • The DreaMoving result of the input text prompt and facial image.


  • The DreaMoving result of the input facial and clothing images.


  • The DreaMoving result of the input stylized image.


Effect performance

DreaMoving can generate high-quality and high-fidelity videos given the input of a guiding sequence and simple content descriptions (e.g., text and reference images). Specifically, DreaMoving shows precision in identity control, which is achieved through facial reference images; precise manipulation of motion is realized through pose sequences; and overall video appearance is fully controlled via specific text prompts.

DreaMoving demonstrates generalization capabilities in scenarios that do not exist in real life.

Overview of the architecture

An overview of DreaMoving. The Video ControlNet is an image ControlNet that injects motion blocks after each U-Net block. The Video ControlNet processes control sequences (pose or depth) to generate additional temporal residuals. The denoising U-Net is a derived stable diffusion U-Net with added motion blocks for video generation. The Content Guider converts input text prompts and appearance expressions (such as faces, optional clothing) into content embeddings for cross attention.

Demo

On huggingface at https://huggingface.co/spaces/jiayong/Dreamoving