"AutoStory: Minimal Input for High-Quality, Diverse Storytelling Images"

Today, Bob shared a paper from Zhejiang University titled "AutoStory: Generating Diverse Story Narratives with Minimal Human Effort." https://github.com/aim-uofa/AutoStory

It is said that the code will be open-sourced later. For now, let's take a look at its effects and the Paper. https://arxiv.org/pdf/2311.11243.pdf

【Abstract】

Story visualization aims to generate a series of images that match textual descriptions, requiring the generated images to be high-quality, consistent with the text description, and maintain consistency in character identity. This paper proposes an automated story visualization system capable of effectively generating diverse, high-quality, and consistent sets of story images with minimal human-computer interaction.

Specifically, it uses the understanding and planning capabilities of large language models for layout planning, then leverages large-scale text-to-image models to generate complex story images based on the layout.

The research team empirically found that sparse control conditions (such as bounding boxes) are suitable for layout planning, while dense control conditions (such as sketches and keypoints) are suitable for generating high-quality image content. Therefore, they designed a dense condition generation module that converts simple bounding box layouts into sketch or keypoint control conditions for final image generation, not only improving image quality but also allowing users to interact easily and intuitively.

In addition, the paper proposes a simple yet effective method for generating multi-view consistent character images, eliminating the need for human effort to collect or draw character images. This allows for consistent story visualization even when only text is provided as input.

【Overall Process】

Users only need to provide a brief command describing the story and optionally provide a few images for each character:

(a) Conditional preparation phase: Generate bounding box layouts corresponding to the text prompt, as well as dense conditions such as sketches or keypoints;
(b) Conditional image generation phase: Use a multi-subject custom model to generate story images under the guidance of prepared conditions.
(c) Story-to-layout conversion: Use a large language model (LLM) for prompt and layout generation;
(d) Dense condition generation: Extract dense control signals from object images generated by a single-subject custom model using existing perception models.

【Secret to Consistent Character Image Generation】

To generate multiple identity-consistent images of a single character in (c), first generate a single image of the character, then apply an image transformation model with a viewpoint condition in (a) to obtain multi-viewpoint images. Then, extract the sketch conditions of these images in (b) and use them as conditions to enhance the diversity of the final character image generation. In (d), an untrained consistency modeling method is introduced to improve identity consistency.

【Comparison】

You can see the effect comparison between this paper's method and other methods, with AutoStory's method being the last row.

【More Implementation Details】

Detailed prompts for large language models. As described in Section 3 of the main text, large language models (LLMs) are used to complete

Step 1: Story Generation

Step 2: Frame Segmentation

Step 3: Global Prompt Generation

Step 4: Layout Generation

【Regarding Main Results】

The figure below shows the results of story image generation under different characters, storylines, and image styles using the method presented in this paper. Here, the character images used to train each story customization model are displayed. The story visualization results in the first two columns on the left were obtained using character images provided by users.

【Intermediate Result Visualization】

If visualizing the intermediate process of generating a single story image, individual character images are first generated based on local prompts generated by LLMs, as shown in (a) and (b).

Then, perception models including Grounding-SAM, PidiNet, and HRNet are used to obtain keypoints for human characters or sketches for non-human characters, as shown in (c) and (d).

Subsequently, the keypoints or sketches of individual subjects are combined into dense conditions for story image generation based on the layout generated by LLMs, as shown in (e).

Finally, based on these dense conditions, prompts, and layouts, story images are generated, as shown in (f).

【More Stories】

There are more story visualization results in the figure below. It can be seen that even when generating long stories, AutoStory can produce high-quality, text-aligned, and identity-consistent story images.