Alibaba's Tora vs Tencent's MotionCtrl - Controlling object trajectories in generated videos

Compared with Tora from Alibaba and MotionCtrl from Tencent, both are used to generate video for controlling the trajectory of objects.

: Tencent's MotionCtrl was open-sourced at the end of last year.
: The relevant paper on Alibaba's Tora was just released last week and has not yet been open-sourced.

Effect comparison:

Tora

MotionCtrl

Method comparison:

Overview of Tora architecture：

To achieve trajectory control in video generation based on DiT, Tora introduces two new modules: the Trajectory Extractor and the Motion-guidance Fuser.
: Uses a 3D motion VAE to embed trajectory vectors into the same latent space as the video clips, effectively maintaining motion information across consecutive frames. Subsequently, hierarchical motion features are extracted by stacking convolutional layers.
: Utilizes adaptive normalization layers to seamlessly inject these multi-level motion conditions into the corresponding DiT modules, ensuring that the generated videos consistently follow the defined trajectories.
Tora’s method aligns with the scalability of DiT, capable of creating high-resolution, motion-controllable long-duration videos.

MotionCtrl architecture：

MotionCtrl extends the denoising U-Net structure of LVDM, adding a Camera Motion Control Module (CMCM) and an Object Motion Control Module (OMCM).
: Integrates the camera pose sequence (RT) with LVDM’s time transformer, appending RT to the input of the second self-attention module and applying a custom lightweight fully connected layer to extract camera pose features for subsequent processing.
: Derives multi-scale features from Trajs using convolutional layers and downsampling, which are spatially integrated into LVDM’s convolutional layers to guide object motion. Furthermore, given a text prompt, LVDM generates video corresponding to the prompt from noise, where background and object motions reflect the specified camera poses and trajectories.