Meta's multi-modal model Transfusion - Transformer + Diffusion

In a new research paper, scientists from Meta and USC introduce a new technology called Transfusion, which enables a single model to seamlessly handle both discrete and continuous modalities.

Transfusion is a method for training a single model capable of handling both discrete and continuous modalities without quantization or the use of separate modules. The core idea of Transfusion is to set two objectives during the training of a single model: language modeling for text and diffusion modeling for images. Transfusion combines these two objectives to train a transformer model that can process and generate both text and images. During training, the model receives both textual and image data, and loss functions for both language modeling and diffusion modeling are simultaneously applied to the model.

Diffusion models and next-token-prediction autoregressive models represent the best worlds for generating continuous and discrete data respectively, This inspired us to develop a new multi-modal method that combines the best of both worlds in a natural and simple way.

-- Chunting Zhou, Co-author

Technology

Meta's Transfusion uses a single transformer architecture to handle both text and images.

Transfusion uses a variational autoencoder (VAE) to break down images into 8×8 patches rather than performing diffusion at the pixel level.

In terms of attention mechanisms, Transfusion applies causal attention to text tokens to prevent information leakage from future tokens, while using bidirectional attention for image patches since images do not inherently possess natural ordering.

Examples of image generation

The researchers conducted separate experiments on image generation and compared Transfusion with other image generation models. The results showed that Transfusion outperformed other popular models such as DALL-E 2 and Stable Diffusion XL, while also being capable of generating text.