Skip to main content

Are all Frames Equal? Active Sparse Labeling for Video Action Detection

 

Publication

Aayush J Rana, Yogesh S Rawat. Are all Frames Equal? Active Sparse Labeling for Video Action Detection. 36th Conference on Neural Information Processing Systems (NeurIPS 2022).

Abstract

Video action detection requires annotations at every frame, which drastically increases the labeling cost. In this work, we focus on efficient labeling of videos for action detection to minimize this cost. We propose active sparse labeling (ASL), a novel active learning strategy for video action detection. Sparse labeling will reduce the annotation cost but poses two main challenges; 1) how to estimate the utility of annotating a single frame for action detection as detection is performed at video level?, and 2) how these sparse labels can be used for action detection which require annotations on all the frames? This work attempts to address these challenges within a simple active learning framework. For the first challenge, we propose a novel frame-level scoring mechanism aimed at selecting most informative frames in a video. Next, we introduce a novel loss formulation which enables training of action detection model with these sparsely selected frames. We evaluate the proposed approach on two different action detection benchmark datasets, UCF-101-24 and J-HMDB-21, and observed that active sparse labeling can be very effective in saving annotation costs. We demonstrate that the proposed approach performs better than random selection, outperforming all other baselines, with performance comparable to supervised approach using merely 10% annotations.

Overview

In this work, we focus on reducing the annotation effort for video action detection. The existing work in label efficient learning for action detection is mostly focused on semi-supervised or weakly-supervised approaches. They rely on separate (often external) actor detectors and tube linking methods coupled with weakly-supervised multiple instance learning or pseudo-annotations, limiting the practical simplicity for general use. We argue that a lack of selection criteria for annotating only informative data is one of the limitations in these methods. Motivated by this, we propose active sparse labeling (ASL) which bridges this gap between high performance and low annotation cost. ASL performs partial instance annotation (sparse labeling) by frame level selection where the goal is to annotate most informative frames, which are expected to be useful for activity detection task.


Figure 1: Overview of proposed approach. It consists of training and selection. During training the network is trained using existing annotations from the training set using MGW-loss to handle the sparse annotations. During iterative APU selection phase, the trained network is used to predict localizations on each frame of videos in the training set. Using these predictions, APU computes a score for each frame in a video to rank them and top K frames are sent to oracle for annotation.

We make the following contributions:

  • We propose active sparse labeling (ASL), a novel active learning (AL) strategy for action detection where each instance is partially annotated to reduce the labeling cost. This is the first work focused on AL for video action detection to best of our knowledge.
  • We propose a novel scoring mechanism for selecting informative and diverse set of frames.
  • We also propose a novel training objective which helps in effectively learning from sparse labels.

Evaluations

We evaluate our approach on UCF-101-24 and J-HMDB-21 datasets for spatio-temporal action detection and on YouTube-VOS dataset for video object segmentation task to demonstrate that our AL selection method can generalize to other video tasks. For spatio-temporal action detection evaluation, we compute the spatial IoU for each frame per class to get the frame average precision score and compute the spatio-temporal IoU per video per class to get the video average precision score score. This is then averaged to obtain the f-mAP and v-mAP scores over various thresholds. For video segmentation, we evaluate the average IoU (J score) and the average boundary similarity (F score).

Table 1: Comparison with state-of-the-art methods. We evaluate our approach using v-mAP and f-mAP scores using only 10% annotations. ‘Video’ uses video-level class annotations and ‘Partial’ uses sparse temporal and spatial annotations. V: video labels, P: points, B: bounding box, O: off-the-shelf detector. f@ denotes f-mAP@.

For UCF-101 we initialize with 1% of labelled frames and train the action detection model with a step size of 5% in each cycle. We achieve results very close to full annotations (v-mAP@0.5: 73.20 vs 75.12) using only 10% of annotated frames, which is a huge reduction (90%) in the annotation cost. For J-HMDB, we initialize with 3% labels as it is a relatively smaller dataset and it is challenging to train an initial model with just 1% labels. Here, we obtain results comparable with 100% annotations with only 9% of labels. Compared with prior weakly/semi-supervised methods, we outperform them as our ASL is able to learn with the pseudo-labels for sparse annotation setting while the AL cycle is able to select frames spatio-temporally useful for action detection.

Table 2: Comparison of the proposed method on YouTube-VOS dataset with baseline AL methods using STCN [84]. A = Aghdam et al. [53], G = Gal et al. [73]. * is extended to video object segmentation using the same network as ours.

We test generalization of proposed cost and loss function for video object segmentation task on the YouTube-VOS 2019. Table 2 shows that our proposed selection approach gets better J and F scores for video segmentation task compared to baseline AL methods and random frame selection method.

Qualitative Results

Figure 2: Analysis of frame selection using different methods. The x-axis represents all frames of the video, with each row representing a baseline method. The markers for each method mark the frames selected using that method. For both samples, our method selects distributed frames centered around action region, Gal et al [73] [G*] selects frame around same region since there is no distance measure and Aghdam et al. [53] [A*] selects slightly more distributed frames but those are not from crucial action region. [G*:Gal et al[73], A*:Aghdam et al[53], Rand: Random, Equi: Equidistant]

Conclusion

We demonstrate the effectiveness of the proposed approach in optimizing annotation cost for video action detection in two different datasets, UCF-101-24 and J-HMDB-21. We reduce the annotation cost by ~90% with marginal drop in performance. We also evaluate the proposed approach for video object segmentation and demonstrate its generalization capability.