StableCascade: A Faster and Better Open-Source Image Generation Model

Today I checked out StableCascade, released by Stability AI (one of the companies behind Stable Diffusion). Both the speed and quality of their image generation are exceptional.

The latest version of ComfyUI already supports these features.
Just update ComfyUI to the latest version and download the corresponding few models, and you can run the workflow.
You can use this demo's workflow to experience their high-speed and high-quality performance. https://gist.github.com/comfyanonymous/0f09119a342d0dd825bb2d99d19b781c

Model Introduction

This model is built on the Würstchen architecture, and its main difference from Stable Diffusion lies in that it operates in a smaller latent space, allowing for faster inference and lower training costs.

Stable Diffusion uses a compression factor of 8, resulting in 1024x1024 images being encoded as 128x128. Stable Cascade achieves a compression factor of 42, meaning that 1024x1024 images can be encoded as 24x24 while maintaining clear reconstruction quality. Then, a text-conditioned model is trained in this highly compressed latent space. A previous version of this architecture achieved a 16x cost reduction compared to Stable Diffusion 1.5. Stable Cascade has achieved impressive results both visually and in evaluations, performing best in prompt alignment and aesthetic quality in almost all comparisons.

All known extensions, such as fine-tuning, LoRA, ControlNet, IP-Adapter, LCM, etc., can also be implemented using this method. Some of these (fine-tuning, ControlNet, LoRA) have already been provided for training and inference. Currently, ComfyUI does not support these extensions, but support should come soon.

ControlNet

LoRA

Model Overview

Stable Cascade consists of three models: Stage A, Stage B, and Stage C, representing the cascading process of generating images, hence the name "Stable Cascade." Stages A and B are used for image compression, similar to the role of VAE in Stable Diffusion. However, as mentioned earlier, this setup allows for a higher degree of image compression. Additionally, Stage C is responsible for generating a small 24 x 24 latent space based on textual prompts. The following figure intuitively illustrates this process. Note that Stage A is a VAE, while Stages B and C are both diffusion models.

For this release, StabilityAI provides two checkpoints for stage C, two for stage B, and one for stage A. Stage C has a 1 billion parameter version and a 3.6 billion parameter version, but the official recommendation is to use the 3.6 billion parameter version as most of the work has been focused on its fine-tuning. The two versions of stage B have 700 million and 1.5 billion parameters respectively. Both can achieve excellent results, but the 1.5 billion parameter version performs better in reconstructing small details. Therefore, using the larger variant at each stage will yield the best results. Finally, stage A contains 20 million parameters and remains fixed due to its smaller scale.

Comparison

Stable Cascade (30 inference steps) was compared with Playground v2 (50 inference steps), SDXL (50 inference steps), SDXL Turbo (1 inference step), and Würstchen v2 (30 inference steps).

Stable Cascade's focus on efficiency is reflected in its architecture and more compressed latent space. Despite the largest model having 1.4 billion more parameters than Stable Diffusion XL, it still offers faster inference times.