, which received great feedback from the community. Yesterday, Meta released SAM 2, a unified model for real-time object segmentation that applies to both images and videos, achieving industry-leading performance.
Meta has shared the code and model weights under a permissive Apache 2.0 license. Additionally, Meta also released the SA-V dataset, which includes approximately 51,000 real-world videos and over 600,000 spatial-temporal masks (masklets). (Real open-source!!)
SAM 2 can segment any object in any video or image, even objects and visual domains it has never seen before, allowing it to be applied across various scenarios without customization. SAM 2 has many potential practical applications; for example, its output can be combined with generative video models to create new video effects and unlock new creative applications. SAM 2 can also help accelerate annotation tools for visual data, building better computer vision systems.
Web preview
The web-based demo of SAM 2 provides a preview function that allows users to segment and track objects in videos and apply effects. https://sam2.metademolab.com/. I tracked a football⚽️ and a watch⌚️, and as the video frames passed, the results were indeed quite good. There is also an open-source model available on Github for setting up your own environment: https://github.com/facebookresearch/segment-anything-2.
Technical framework
Segment Anything Model 2 (SAM 2) is a foundational model dedicated to solving prompt-based visual segmentation in images and videos. We extended SAM to video processing by treating images as single-frame videos. The model is designed as a simple transformer architecture with streaming memory to enable real-time video processing.
We built a model-loop data engine that improves models and data through user interaction, collecting our SA-V dataset (as shown in the figure below), which is the largest online video segmentation dataset to date. After training on our data, SAM 2 performs well across a wide range of tasks and visual domains.
Architecture evolution
In the architectural evolution from SAM to SAM 2, SAM 2 adds an occlusion head to predict whether an object is visible, enabling segmentation even when the object is temporarily occluded.
Use cases
SAM 2 can be directly applied to various practical use cases, such as tracking objects to create video effects (as shown in the left image), or segmenting moving cells in videos captured by microscopes to assist scientific research (as shown in the right image).
In the future, SAM 2 could be part of larger AI systems, recognizing everyday items through AR glasses and providing reminders and instructions to users.
Comparison
In the comparison, both models initialized the mask of the T-shirt in the first frame. The baseline model used SAM's mask. In contrast, SAM 2 can accurately track object parts throughout the entire video, while the baseline model over-segments, including the person's head instead of just tracking the T-shirt.
SAM 2 (right image) outperforms SAM (left image) in terms of object segmentation accuracy in images.