CatVTON virtual try-on released by Sun Yat-sen University and Meitu

Many methods for virtual try-on have been shared:

Today, we will introduce a new project released by Sun Yat-sen University and Meitu —— 🐈

Virtual try-on methods based on diffusion models can achieve realistic effects but require copying the backbone network as a ReferenceNet or using additional image encoders to process conditional inputs, leading to high training and inference costs. In this work, the team rethinks the necessity of ReferenceNets and image encoders, innovates the interaction between clothing and people, and proposes CatVTON, a simple and efficient diffusion model for virtual try-on. This model achieves seamless transfer of any category of in-store or worn clothing to target individuals by simply concatenating inputs in the spatial dimension.

The efficiency of the CatVTON model is reflected in three aspects:

  1. : Only the original diffusion module is used without any additional network modules. The text encoder and cross-attention module for text injection in the backbone network are removed, further reducing 167.02M parameters.

  2. : Through experiments, the modules related to virtual try-on were identified, and only 49.57M parameters (approximately 5.51% of the backbone network parameters) were trained to achieve high-quality try-on results.

  3. : CatVTON removes all unnecessary conditions and preprocessing steps, including pose estimation, human parsing, and text input. It only requires a clothing reference, target person image, and mask to complete the virtual try-on process.

Trial:

ComfyUI Workflow

Gradio App


Structure

The method of CatVTON achieves high-quality try-on results by simply concatenating conditional images (clothing or reference person) and target person images in the spatial dimension, ensuring they remain in the same feature space throughout the diffusion process. During training, only self-attention parameters providing global interaction are learnable. The unnecessary cross-attention module for text interaction is omitted, and no additional conditions such as pose and parsing are required. These factors make our network lightweight with minimal trainable parameters and simplify the inference process.

Comparison

Structural Comparison

We show a simple structural comparison of different virtual try-on methods below. Our method does not rely on deforming clothing or heavy ReferenceNets for additional clothing encoding; it only needs to concatenate clothing images and person images as input to achieve high-quality try-on results.

Efficiency Comparison

We use two concentric circles to represent each method, where the outer circle represents the total number of parameters, and the inner circle represents the number of trainable parameters, with the area proportional to the number of parameters. On the VITONHD dataset, CatVTON achieves a lower FID value with fewer total parameters, trainable parameters, and memory usage.