The ActEV (Activities in Extended Video) Sequestered Data Leaderboard is an ongoing ranking of software systems that watch lengthy videos and detect activities of interest. Anyone can submit their system to NIST, which will then run the system on sequestered data, score the results and post the score to the leaderboard. The sequestered data is from the MEVA dataset, which contains hours of videos, including indoor and outdoor scenes, night and day, crowds and individuals, and videos are from both EO (Electro-Optical) and IR (Infrared) sensors. Hours can go by with no activities, but then multiple activities happen simultaneously. The data is multi-camera, in that multiple cameras may be pointed at the same scene at the same time. There are also two separate leaderboards for EO and IR videos.
Activity detection in security videos is a difficult problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and viewpoints, and its untrimmed nature. The existing research in activity detection is mainly focused on datasets, such as UCF-101, JHMDB, THUMOS, and AVA, which partially address these issues. The requirement of processing the security videos in real-time makes this even more challenging. UCF won the NIST ActEV Challenge at CVPR 2020 using Gabriella, a real-time online system to perform activity detection on untrimmed security videos. The proposed method consists of three stages: tubelet extraction, activity classification, and online tubelet merging. For tubelet extraction, we propose a localization network which takes a video clip as input and spatio-temporally detects potential foreground regions at multiple scales to generate action tubelets. We propose a novel Patch-Dice loss to handle large variations in actor size. Our online processing of videos at a clip level drastically reduces the computation time in detecting activities. The detected tubelets are assigned activity class scores by the classification network and merged together using our proposed Tubelet-Merge Action-Split (TMAS) algorithm to form the final action detections. The TMAS algorithm efficiently connects the tubelets in an online fashion to generate action detections which are robust against varying length activities. We perform our experiments on the VIRAT and MEVA (Multiview Extended Video with Activities) datasets and demonstrate the effectiveness of the proposed approach in terms of speed (~100 fps) and performance with state-of-the-art results.