Alibaba's EchoMimic - Generating Portrait Videos

Yesterday, we introduced Alibaba's speech model, and today we introduce Alibaba's video model —— EchoMimic, which can generate portrait videos. It can not only use audio and facial landmarks separately but also combine audio with selected facial landmarks to generate videos. It was just released last week.

Introduction

EchoMimic can generate portrait videos, not only using audio and facial landmarks separately but also combining audio with selected facial landmarks for generation.

The field of animating portrait images has made significant progress in generating realistic and dynamic portraits driven by audio input. Traditional methods usually use either audio or facial keypoints to drive image-to-video generation. Although these methods produce satisfactory results, there are still some issues. For example, methods solely relying on audio may be unstable due to weak audio signals, while those solely relying on facial keypoints, although more stable, may appear unnatural due to excessive control over the keypoint information. EchoMimic uses a novel training strategy that simultaneously trains with both audio and facial landmarks. Through this method, EchoMimic can not only generate portrait videos using audio and facial landmarks separately but also combine audio with selected facial landmarks for generation.

Scenarios

Audio-driven (Chinese)

Audio-driven (English)

Audio-driven (Singing)

Facial landmark-driven

Audio + Selected Facial Landmark-driven

Usage in ComfyUI

EchoMimic is also supported in ComfyUI.

Comparison