CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing
Kevin Duarte, Yogesh S Rawat, Mubarak Shah. CapsuleVOS: Semi-Supervised Video Object Segmentation Using Capsule Routing. International Conference on Computer Vision (ICCV 2019), Seoul, South Korea, Oct 27-Nov 2, 2019.
In this work, we propose a capsule-based approach for video object segmentation. Given a video and the segmentation mask of one (or multiple) object(s) in the first frame, our method segments the object throughout the video. Our network, CapsuleVOS, segments several frames at once, conditioned on a reference frame and segmentation mask. The conditioning is performed through a novel routing algorithm for attention-based capsule routing.
Two challenging issues in VOS are 1) the segmentation of small objects and 2) object occlusion. Our method addresses both of these issues. We introduce a novel zooming module that allows the network to process small spatial regions of the video. Furthermore, our network utilizes a memory module to effectively track objects when they move out of the frame or are occluded.
The zooming module allows for fine-grained segmentation of small objects.
- Quantitative Results
- Qualitative Results
Our method is end-to-end, and we demonstrate its effectiveness on the Youtube-VOS dataset. We achieve state-of-the-art results at a speed which is almost twice as fast as the next fastest method.
Below we plot comparison of the performance and speed of previous VOS methods on YoutubeVOS. Since our network segments 8 frames at once, we are able to segment videos at a higher fps than contemporary methods.
Below we include qualitative results that demonstrate our network’s capabilities. We include examples with multiple objects as well as multiple instances of the same object.
The following results illustrate the effectiveness of our zooming module. Using our zooming module, CapsuleVOS is able to successfully segment extremely small objects.
In the qualitative results below, we show the result of our memory module. The network is able to segment objects after partial occlusion, full occlusion, and leaving and reentering the frame.