Google's newly launched Genie model — AI-generated games

Last week, Google published a paper titled "Genie: Generative Interactive Environments," which represents the first generative interactive environment trained in an unsupervised manner from unlabelled internet videos. The model can generate various controllable virtual worlds based on descriptions such as text, synthetic images, photographs, or even sketches.

With 11 billion parameters, Genie can be considered a foundational world model. It consists of a spatiotemporal video tokenizer, an autoregressive dynamics model, and a simple yet scalable latent action model. Despite no real action labels or other domain-specific requirements commonly found in world model literature being used during training, Genie allows users to manipulate the generated environments frame by frame. Moreover, the learned latent action space enables agents to imitate behaviors from videos they have never seen before, paving the way for future training of general-purpose agents.

Introduction

Genie utilized a dataset containing over 200,000 hours of two-dimensional platform game videos and trained an 11-billion-parameter world model. Through unsupervised learning, Genie mastered a variety of latent actions that can consistently control characters.

This model can transform any image into a playable 2D world. For example, Genie can bring human-designed creations to life, such as the beautiful artwork from Seneca and Caspian, two of the youngest world creators in history.

The latent action space learned by Genie is not only diverse and consistent, but also interpretable. Usually, after a few attempts, humans can understand its mapping to semantically meaningful actions (such as walking left, walking right, jumping, etc.).

Technology

Genie uses a combination of spatiotemporal video tokenizers, autoregressive dynamics models, and latent action models to generate controllable video environments. It is trained using only video data without requiring action labels, inferring latent actions between frames through unsupervised learning, thereby enabling frame-by-frame control of the generated video sequences. To alleviate the quadratic memory cost issue brought by Vision Transformers for video processing, Genie employs a memory-efficient ST-transformer in all components. The model consists of three parts: the video tokenizer, the latent action model, and the dynamics model, as shown below:

: Analyzes video frames and converts them into a series of representative tokens, capturing spatiotemporal information in the video.
: Learning and inferring actions and changes between different frames, which are not explicitly labeled in the training data.
: Predicting the next video frame based on the current frame and inferred latent actions, generating the next frame in the video sequence.

This structural design enables Genie to learn how to control and generate dynamic environments through videos themselves without relying on external annotations, providing a strong foundation for creating complex video simulations and interactive experiences.

The key to training the model lies in data and computing power. The team trained a classifier specifically for screening high-quality parts of the video dataset and conducted experiments on an extended scale. The experimental results showed that as the number of model parameters and batch size increased, the performance of the model gradually improved. The final developed model contained 11 billion parameters.

Results

Playback from image prompts: Images generated by text-to-image models, hand-drawn sketches, or real-world photos can be used to prompt Genie. In each case, the prompt frame and the second frame after taking four consecutive latent actions are shown here. In each case, significant character movement is visible, even though some images are visually distinct from the dataset.

Genie's model has broad applicability and is not limited to two dimensions. The team also trained Genie on robotic data (RT-1). Although these data do not contain action labels, it demonstrates its ability to learn a controllable action simulation environment based on this data. This represents a promising step towards developing a general world model for artificial general intelligence (AGI).

The Genie model currently runs at 1 FPS (frames per second), which indeed means it is still far from real-time playability. However, although 1 FPS is far from sufficient for real-time interaction or gaming experience, considering the complexity and depth of the model's processing, this is already a fairly impressive achievement.