Video Object Segmentation
Introduction
In this project, our goal is to detect the primary object in videos and to delineate it from the background in all frames. Video object segmentation is a well-researched problem in the computer vision community and is a prerequisite for a variety of high-level vision applications, including content based video retrieval, video summarization, activity understanding and targeted content replacement. Both fully automatic methods and methods requiring manual initialization have been proposed for video object segmentation. In the latter class of approaches, some need annotations of object segments in key frames for initialization. Optimization techniques employing motion and appearance constraints are then used to propagate the segments to all frames. Other methods only require accurate object region annotation for the first frame, then employ region tracking to segment the rest of frames into object and background regions. Note that, the aforementioned semi-automatic techniques generally give good segmentation results. However, most computer vision applications involve processing of large amounts of video data, which makes manual initialization cost prohibitive. Consequently, a large number of automatic methods have also been proposed for video object segmentation. A subset of these methods employs motion grouping for object segmentation. Other methods use appearance cues to segment each frame first and then use both appearance and motion constraints for a bottom-up final segmentation. However, all of these automatic methods do not have an explicit model of how an object looks or moves, and therefore, the segments usually don't correspond to a particular object but only to image regions that exhibit coherent appearance or motion.In this project, we present an approach that though inspired from aforementioned approaches, attempts to remove their shortcomings. Note that, in general, an object's shape and appearance varies slowly from frame to frame. Therefore, the intuition is that the object proposal sequence in a video with high `object-ness', and high similarity across frames is likely to be the primary object. To this end, we use optical flow to track the evolution of object shape, and compute the difference between predicted and actual shape (along with appearance) to measure similarity of object proposals across frames. The `object-ness' is measured using appearance and a motion based criterion that emphasizes high optical flow gradients at the boundaries between objects proposals and the background. Moreover, the primary object proposal selection problem is formulated as the longest path problem for Directed Acyclic Graph (DAG), for which optimal solution exist in linear time. Note that, if the temporal order of object proposals locations (across frames) is not used , then it can result in no proposals being associated with the primary object for many frames (please see Figure 1). The proposed method not only uses object proposals from a particular frame (please see Figure 2), but also expands the proposal set using predictions from proposals of neighboring frame. The combination of proposal expansion, and the predicted shape based similarity criteria results in temporally dense and spatially accurate primary object proposal extraction. We have evaluated the proposed approach using several challenging benchmark videos and it outperforms both unsupervised and supervised state-of-the-art methods
Figure 1: Primary object region selection in the object proposal domain
Figure 2: Object proposal examples
Method
The Framework
The proposed framework consists of 3 stages (as shown in Figure 3):Figure 3: The framework
1. Generation of object proposals per-frame and then expansion of the proposal set for each frame based on object proposals in adjacent frames.
2. Generation of a layered DAG from all the object proposals in the video. The longest path in the graph fulfills the goal of maximizing object-ness and similarity scores, and represents the most likely set of proposals denoting the primary object in the video.
3. The primary object proposals are used to build object and background models using Gaussian mixtures, and a graph-cuts based optimization method is used to obtain refined per-pixel segmentation.
Layered DAG Structure
Figure 4: Graph Structure
We want to extract object proposals with high object-ness likelihood, high appearance similarity and smoothly varying shape from the set of all proposals obtained from the video. Also since we want to extract the primary object only, we want to extract at most a single proposal per frame. Keeping these objectives in mind, the layered DAG is formed as following. Each object proposal is represented by two nodes: a `beginning node' and an `ending node' and there are two types of edges: unary edges and binary edges. The unary edges have weights which measure the object-ness of a proposal. The details of the function for unary weight assignments (measuring object-ness) are given in our paper. All the beginning nodes in the same frame form a layer, so as the ending nodes. A directed unary edge is built from beginning node to ending node. Thus, each video frame is represented by two layers in the graph. Directed binary edges are built from any ending node to all the beginning nodes in latter layers. The binary edges have weights which measure the appearance and shape similarity between the corresponding object proposals across frames. The binary weight assignment functions are introduced in detail in our paper.
Figure 4 is an illustration of the graph structure. It shows frame i-1, i and i+1 of the graph, with corresponding layers of 2i-3, 2i-2, 2i-1, 2i, 2i+1 and 2i+2. Note that, only 3 object proposals are shown for each layer for simplicity, however, there are usually hundreds of object proposals for each frame and the numbers of object proposals for different frames are not necessary to be the same. The yellow nodes are "beginning nodes", the green nodes are "ending nodes", the green edges are unary edges with weights indicating object-ness and the red edges are binary edges with weights indicating appearance and shape similarity (note that it only shows some of the binary edges for simplicity). There is also a virtual source node s and a sink node t with 0 weighted edges (black edges) to the graph. Note that, it is not necessary to build binary edges from an ending node to all the beginning nodes in latter layers. In practice, only building binary edges to the next three subsequent frames is enough for most of the videos.
Dynamic Programming Solution
This problem could be solved by dynamic programming in linear time. The computational complexity for the algorithm is O(n+m), in which n is the number of nodes and m is the number of edges. Both the unary edges and binary edges are L-1 normalized and in our experiments.Perpixel Video Object Segmentation
Once the primary object proposals are obtained in a video, the results are further refined by a graph-based method to get per-pixel segmentation results. We define a spatial-temporal graph by connecting frames temporally with optical flow displacement. Each of the nodes in the graph is a pixel in a frame, and edges are set to be the 8-neighbors within one frame and the forward-backward 18 neighbors in adjacent frames.We use the graph-cuts based minimization method to obtain the optimal solution for the problem, and thus get the final segmentation results.
Results
We show results on several publicly available datasets.SegTrack Dataset
Figure 7: Results on SegTrack Dataset
Video | Ours | [14] | [13] | [20] | [6] |
---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
GaTech Video Segmentation Dataset
Figure 9: Results on GaTech Dataset