Skip to main content

Video Object CoSegmentation


In this project, we propose a novel approach for object co-segmentation in arbitrary videos by sampling, tracking and matching object proposals via a Regulated Maximum Weight Clique (RMWC) extraction scheme. The proposed approach is able to achieve good segmentation results by pruning away noisy segments in each video through selection of object proposal tracklets that are spatially salient and temporally consistent, and by iteratively extracting weighted groupings of objects with similar shape and appearance (with-in and across videos). The object regions obtained from the video sets are used to initialize per-pixel segmentation to get the final co-segmentation results. Our approach is general in the sense that it can handle multiple objects, temporary occlusions, and objects going in and out of view. Additionally, it makes no prior assumption on the commonality of objects in the video collection. The proposed method is evaluated on publicly available multi-class video object co-segmentation dataset and demonstrates improved performance compared to the state-of-the-art methods.

Figure 1: The framework of the proposed method. 

The proposed approach has the following advantages:

1. The proposed method employs object tracklets to obtain spatially salient and temporally consistent object regions for co-segmentation, while most of previous co-segmentation methods simply use pixel-level or region-level features to do clustering. The perceptual grouping of pixels before matching reduces segment fragmentation and leads to a simpler matching problem.

2. The proposed approach does not rely on approximate solutions for object groups. The grouping problem is modeled as a Regulated Maximum Weight Clique (RMWC) problem for which an optimal solution is available. The use of only the salient object tracklets for grouping keeps the computational cost low.

3. Unlike the state-of-the-art single video object segmentation method, the proposed method can handle occlusions of objects, or objects going in and out of videos because the object tracklets are temporally local and there is no requirement for the object to continuously remain in the field of view of the video. Furthermore, there is no limitation on the number of object classes in each video and the number of common object classes in the video collection. Therefore the proposed approach can be used to extract objects in an unsupervised fashion from general video collections.

4. The proposed method is different from Maximum Weight Clique Problem which has already been explored in video object segmentation, in a way that the clique weights of the proposed method is not simply defined as the summation of node weights, but regulated by the intra-clique consistency term. Therefore, the extracted cliques have more global consistency, and ensure similar objects from different videos to be grouped perfectly.


The Framework

The proposed framework consists of 2 stages (as shown in Figure 1):

1 Object Tracklets Generation. In this stage, we generate a number of object proposals for each frame and use each of them as a starting point, and track the object proposals backward and forward throughout the whole video sequence. We generate reliable tracklets from the track set (those with high similarity over time) and perform non-maxima suppression to remove noisy or overlapping proposals.

Figure 2: Object Proposal Tracking 
2 Multiple Objects Co-Segmentation by Regulated Maximum Weight Cliques. A graph is generated by representing each tracklet as a node from all videos in the collection. The nodes of the graph are weighted by their appearance and motion scores, and edges are weighted by tracklet similarity. Edges with weight below a threshold are removed. A Regulated Maximum Weight Clique extraction algorithm is used to find objects ranked by score which is a combination of intra-group consistency and Video Object Scores. The object regions obtained from the video sets are used to initialize per-pixel segmentation to get the final co-segmentation results.


Figure 3: Results on MOViCS Dataset 

Table 1: Quantitative Results on MOViCS Dataset

Video Set Ours 1 Ours 2 VCS [23] ICS [7]
Chicken vs. turble
Zebra vs. lion
Giraffe vs. elephant


Safari Dataset

We have collected a challenging video co-segmentation dataset, which consists of 9 videos. There are 5 types of animals appear in these videos, and the goal is to segment these animals from the videos and assign same label for same type of animal.

Figure 4: Safari Dataset Structure 

Figure 5: Safari Dataset Results 

Table 2: Quantitative Results on Safari Dataset

Object Buffalo Elephant Giraffe Lion Sheep
Baseline [23]


Source Code

Safari Dataset

Related Publication

Dong Zhang, Omar Javed, Mubarak Shah, Video Object Co-Segmentation by Regulated Maximum Weight Cliques, European Conference on Computer Vision 2013, Zurich, Switzerland, Sep. 9-12, 2014.