Today, I came across a project released by Nankai University and ByteDance called StoryDiffusion. By proposing a consistency self-attention mechanism, it can generate comics in various styles while maintaining the consistency of character style and costumes, thereby achieving coherent storytelling.
Feature
Cartoon character generation
StoryDiffusion can create stunningly consistent cartoon-style characters.
Multi-character generation
StoryDiffusion can simultaneously maintain the identity consistency of multiple characters and generate consistent characters across a series of images.
Long video generation
StoryDiffusion uses an image semantic motion predictor to generate high-quality videos by using either generated consistent images or user-input images as conditions.
Video editing demonstration
the performance of the motion predictor.
Method
Structure of consistency self-attention
StoryDiffusion's generation pipeline is used for generating theme-consistent images.
To create theme-consistent images that describe stories, StoryDiffusion integrates a consistency self-attention mechanism into a pre-trained text-to-image diffusion model.
StoryDiffusion divides the story text into multiple prompts and uses these prompts to batch-generate images.
Consistency self-attention establishes connections between multiple images generated in batches to maintain thematic consistency.
Structure of the motion predictor
StoryDiffusion's method pipeline is used to generate transition videos to obtain theme-consistent images, as described in Section 3.1.
To effectively simulate large-scale character movements, StoryDiffusion encodes the conditional image into the image semantic space to encode spatial information and predict transition embeddings.
These predicted embeddings are then decoded using a video generation model and serve as control signals in cross-attention to guide the generation of each frame.
Example
I ran an example with Huacheng myself: