Visual-textual Capsule Routing for Text-based Video Segmentation
Bruce McIntosh, Kevin Duarte, Yogesh S Rawat, Mubarak Shah. Visual-textual Capsule Routing for Text-based Video Segmentation. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 9942-9951. [Supplementary] [BibTex]
In this work, we present a capsule-based approach to merge visual and textual data for the task of actor and action segmentation from a sentence. Given a video and a textual description of an actor within the video, the goal of this problem is to segment the actor throughout the video. The novelty of our approach is 1) the merging of the visual and textual inputs through a routing-by-agreement algorithm, and 2) the localization of actors for all frames in the input video.
Previous methods for merging of visual and textual data in convolutional neural networks either involves the concatenation of visual and textual features followed by a convolution operation or the multiplication of the visual and textual features (also known as dynamic filtering ). We leverage the ability of capsule networks to model entities though high-dimensional coincidence filtering to merge these modalities. By creating a set of video capsules and a set of sentence capsules, each of which represent the objects within their respective inputs, we can use routing-by-agreement to find the objects present within both the video and sentence. Our Visual-Textual routing algorithm does exactly this, and allows the creation of an object-based visual-textual capsule representation. Our network is depicted below:
Whereas previous methods for actor/action video segmentation from a sentence  tend to only segment a single frame when given a sequence of frames, our method simultaneously segments all frames which are input to the network. Given T frames, the network outputs T segmentation maps corresponding to the input sentence. To train and evaluate our method on full-video segmentation, we have extended the A2D dataset  with bounding-box annotations for each actor. The following are the annotations and the documentation for annotation format: Json and README.
We train and evaluate our method on the A2D dataset. We achieve state-of-the-art results when compared to recent methods for actor and action segmentation from a sentence.
Following previous works, we also evaluate our trained model on the JHMDB dataset:
We also present some qualitative results of our network below. The sentence query corresponds with the segmentation colors. The first row contains the segmentations from the network trained only using pixel-wise annotations, and the second row contains the segmentations from the network trained using bounding box annotations on all frames. The segmentations from the network trained using bounding boxes are more box-like, but the extra training data leads to fewer missegmentations or under-segmentations as seen in the second example.
The following are some video examples of our segmentation network.