Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery
Figure 1: (a) Automatically discovered sub-actions in diverse instances of two actions. The number of sub-actions is automatically determined and they are found to be semantically meaningful. Each row shows one example of “clean and jerk” and “long jump” actions from diverse videos. (b) A typical untrimmed video consists of many background segments with one or more actions. We group short segments from untrimmed video into sub-actions whose temporal structure is exploited for temporal action localization.
This paper presents a computationally efficient approach for temporal action detection in untrimmed videos that outperforms state-of-the-art methods by a large margin. We exploit the temporal structure of actions by modeling an action as a sequence of sub-actions. A novel and fully automatic sub-action discovery algorithm is proposed, where the number of sub-actions for each action as well as their types are automatically determined from the training videos. We find that the discovered sub-actions are semantically meaningful. To localize an action, an objective function combining appearance, duration and temporal structure of sub-actions is optimized as a shortest path problem in a network flow formulation. A significant benefit of the proposed approach is that it enables real-time action localization (40 fps) in untrimmed videos. We demonstrate state-of-the-art results on THUMOS’14 and MEXaction2 datasets.
An important fact about actions is that they are usually composed of multiple semantic sub-actions Figure 1(b). While the sub-actions may vary in appearance and duration (e.g., the length of the “approach” run in the “long jump” action), a given action nearly always consists of the same set of sub-actions in a consistent order. Thus, we choose to model an action as a series of sequential sub-actions and train a separate classifier for each sub-action.
An important issue, in context of modeling an action using sub-actions, is how to determine the number of sub-actions for each action. One obvious solution is to manually identify a set of sub-actions for each action and generate training sets by annotating each sub-action in every video; that would be a daunting task. Instead, we propose an automatic method to discover sub-actions for each action. Our approach for discovering sub-actions consists of three main steps. First, temporal segments of all training videos of an action are clustered into different parts. Second, similar parts are merged to obtain candidate sub-actions. Finally, boundaries between candidate sub-actions are adjusted to obtain final sub-actions. Sub-actions discovered in this way are consistent and semantically meaningful Figure 1(a).
Our key assumption is that all the video clips of an action share the same sequence of sub-actions. The goal is to design an approach that can automatically find the appropriate number of sub-actions for each action in an unsupervised manner. Sub-actions should correspond to different semantic parts and be consistent in videos clips of the same action. Moreover, the sub-actions in an action should occur in a specific order.
Since the number of sub-actions in an action is unknown, we first cluster segments in each video of an action into fix number of parts to serve as candidate sub-actions. Second, similar candidate sub-actions are merged together through hierarchical agglomerative clustering. And finally optimize sub-actions in an E-M manner.
Figure 2: An example of segmenting “high jump” action into several sub-actions. (a) Rows represent different length videos of the same action. Temporal segments within a video are represented by key frames. The number on the top of a frame represents the ground truth index of sub-action in the action. In this action there are two sub-actions. (b) In the first step, all segments of each video of an action are clustered into kl (in this case 3) sequential parts, which are shown by borders of different colors (blue, green and red). However, as can be seen that the first sub-action is broken into two parts. (c) In the second step, we use hierarchical agglomerative clustering to merge similar parts. Then the first two parts in (b) are merged. However, in the first clip, one segment is incorrectly merged with the first part. (d) Shows the partitioning results after adjustment. The partitions are updated iteratively.
We evaluate the proposed approach on two challenging temporal action localization datasets: THUMOS’14 and MEXaction2. The qualitative and quantitative results can be seen below:
Figure 3: Temporal Action detection results on THUMOS’14. \alpha is the Intersection-over-Union threshold.
Figure 4: Temporal Action detection results on MEXaction2. \alpha is the Intersection-over-Union threshold.
Rui Hou, Rahul Sukthankar and Mubarak Shah, Real-Time Temporal Action Localization in Untrimmed Videos by Sub-Action Discovery, British Machine Vision Conference (BMVC), 2017. bibtex