TinyVIRAT: Low-resolution Video Action Recognition
Ugur Demir, Yogesh S Rawat, Mubarak Shah. TinyVIRAT: Low-resolution Video Action Recognition. arXiv preprint arXiv:2007.07355 (2020).
The existing research in action recognition is mostly focused on high-quality videos where the action is distinctly visible. In real-world surveillance environments, the actions in videos are captured at a wide range of resolutions. Most activities occur at a distance with a small resolution and recognizing such activities is a challenging problem. In this work, we focus on recognizing tiny actions in videos. We introduce a benchmark dataset, TinyVIRAT, which contains natural low-resolution activities. The actions in TinyVIRAT videos have multiple labels and they are extracted from surveillance videos which makes them realistic and more challenging. We propose a novel method for recognizing tiny actions in videos which utilizes a progressive generative approach to improve the quality of low-resolution actions. The proposed method also consists of a weakly trained attention mechanism which helps in focusing on the activity regions in the video. We perform extensive experiments to benchmark the proposed TinyVIRAT dataset and observe that the proposed method significantly improves the action recognition performance over baselines. We also evaluate the proposed approach on synthetically resized action recognition datasets and achieve state-of-the-art results when compared with existing methods.
Data Set Details
We introduced TinyVIRAT dataset which is based on VIRAT dataset for real-life tiny action recognition problems. VIRAT dataset is a natural candidate for low-resolution actions but it contains a large variety of different actor sizes and it is a very complex since actions can happen any time in any spatial position. To focus only on low-resolution action recognition problem, we crop small action clips from VIRAT videos.
In VIRAT dataset actors can perform multiple actions and temporally actions can start and end at different times. Before deciding which actions are tiny, we merged spatio-temporally overlapping actions and created multi-label action clips. We split these clips if the labels are changing temporally. This steps makes sure that created clips are trimmed. We selected clips that are spatially smaller than 128×128. Finally, long videos are split into smaller chunks and actions which do not have enough samples are removed from the dataset. We use the same train and validation split from the VIRAT dataset.
TinyVIRAT has 7,663 training and 5,166 testing videos with 26 action labels. Table 1 shows statistics from TinyVIRAT and several other datasets. Figure 2 shows the number of samples per action class and the distribution of the videos by spatial size and Figure 1 shows some sample videos from the dataset.
- Table 1: Dataset statistics. ANF: Average number of frames, ML: Multi-label, NC: Number of classes, and NV: Number of videos.
- Figure 1: Some sample video frames for actions from TinyVIRAT} dataset. The dataset contain low-resolution videos with varying sizes. TinyVIRAT is a multi-label dataset and each action video can have multiple action labels.
- Figure 2: Number of samples per action labels and resolution. Numbers on y-axis are shown in log scale.
The proposed method focuses on learning to enhance the quality of low-resolution videos to improve action classification performance. The action classifier network is trained with super-resolved videos instead of raw low-resolution video clips. Our approach consists of two main parts: (i) super-resolution and (ii) action classifier networks which can be seen in Figure 3.
- Figure 3: Overview of the progressive video generation and action classification approach. During the training process, we are introducing new blocks to Progressive DVSR network architecture at each stage. After video synthesis is completed, action classifier process the video to predict actions.
- Table 2: Evaluation of baseline and other approaches on TinyVIRAT dataset.