Large World Model (LWM) - Berkeley's Large World Model

Today, I read a paper titled "World Model on Million-Length Video And Language With Blockwise Ring Attention," which is a study on large world models conducted by UC Berkeley.

Introduction

The Large World Model (LWM) is a general-purpose large-context multimodal autoregressive model. It is trained using RingAttention on a large dataset containing diverse long videos and books, enabling it to perform understanding and generation of language, images, and videos.

Current language models fall short in understanding aspects of the world that are difficult to describe in text and also struggle with complex, lengthy tasks. Video sequences provide valuable temporal information missing from language and static images, making them an attractive choice for joint modeling with language. Such models can develop an understanding of human textual knowledge and the physical world, thereby broadening AI's capabilities in assisting humans.

However, learning from millions of video and language sequences is challenging due to memory limitations, computational complexity, and limited datasets. To address these challenges, this team integrated a large and diverse dataset of videos and books, used Blockwise RingAttention technology to scalably train long sequences, and progressively increased the context size from 4K to 1M tokens.

LWM Functionality

LWM can accurately retrieve facts in contexts exceeding 1M tokens.
LWM can answer questions from YouTube videos lasting over an hour.
LWM can communicate by incorporating images.
LWM can generate videos and images from text.

Training Description of Large World Model

This diagram illustrates the multimodal training process of the Large World Model:

The first stage, namely LLM context expansion, focuses on using the Books3 dataset to expand the context size from 32K to 1M.
The second stage, visual-language training, concentrates on training with visual and video content of varying lengths.

The pie chart details the distribution of 495B tokens allocated among images, short videos, and long videos, as well as 33B textual data tokens. The panel below demonstrates the interactive capabilities in understanding and responding to queries about a complex multimodal world.

More results (with comparison)

LWM answers YouTube videoquestions

Image generation from text

Video generation from text

A ball thown in the air
Slow motion flower petals falling on the ground
A burning campire in a forest
A boat sailing on a stormy ocean