Skip to main content

Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos



Mamshad N Rizve, Ugur Demir, Praveen Tirupattur, Aayush J Rana, Kevin Duarte, Ishan Dave, Yogesh S Rawat, Mubarak Shah. Gabriella: An Online System for Real-Time Activity Detection in Untrimmed Security Videos. 25th International Conference on Pattern Recognition, Italy, 10-15 January 2021 (ICPR 2020), 2020.


Activity detection in security videos is a difficult problem due to multiple factors such as large field of view, presence of multiple activities, varying scales and viewpoints, and its untrimmed nature. The existing research in activity detection is mainly focused on datasets, such as UCF-101, JHMDB, THUMOS, and AVA, which partially address these issues. The requirement of processing the security videos in realtime makes this even more challenging. In this work we propose Gabriella, a real-time online system to perform activity detection on untrimmed security videos. Our method is composed of three modules: tubelet localization, tubelet classification, and tubelet merging. Our action localization module generates pixel-level foreground-background segmentations which localize actions in short video clips. These pixel-level localizations are turned into short spatiotemporal action tubelets, which are passed to a classification network to obtain multi-label predictions. After classification, the tubelets must be linked together to obtain the final detections with varying length; to this end, our novel online Tubelet-Merge Action-Split (TMAS) algorithm merges these short action tubelets into final action detections.

Our action localization module processes multiple frames simultaneously and only produces tubelets which correspond to possible actions within the video. This results in temporally consistent localizations as well as a reduction in the number of proposals, which drastically increases the speed of the overall system. To improve the accuracy of our localization network, we propose a novel Patch-Dice loss. The original global Dice loss [30] allows networks to account for large imbalances between foreground and background (which is the case for security videos with very small actors). However, it does not take into account the variation in scale of different foreground objects/actions, which leads networks to focus on only the largest actions. The Patch-Dice loss solves this, allowing our network to localize actions of any scale by computing loss on local neighborhoods of each frame.

Since activities in untrimmed videos can vary in length, it is necessary to handle both short, atomic activities, like ‘opening a door’ or ‘exiting a vehicle’, as well as long, repetitive actions like ‘walking’ or ‘riding’. To this end, our system processes videos in an online fashion. Once the short tubelets have been localized and classified, our Tubelet-Merge ActionSplit (TMAS) algorithm merges them into final action tubes of varying length. By classifying short tubelets and merging them into action tubes, our system successfully detects both atomic and repetitive actions. Also, since each tube can have multiple activities co-occurring simultaneously, the TMAS algorithm splits them to successfully isolate individual activities. Due to the online nature of the TMAS algorithm, as well as the efficiency of the localization network, our system generates action detections at over 100 fps, greatly exceeding the speed of contemporary action detection methods.


  • Quantitative Results
    • Temporal localization results on VIRAT test set from trecvid-2019 leaderboard. all the metrics relate to the miss-rate, so lower values indicate better performance.
    • Runtime vs AUDC Score of different systems on MEVA test set.
    • Temporal localization results on MEVA sequestered test set. all the metrics relate to the miss-rate, so lower values indicate better performance. these results are from the publicly available leaderboard.
  • Qualitative Results
    • Qualitative results from the localization network overlaid on the input frame; the three rows demonstrate action masks obtained from the groundtruth and generated using the BCE, and Patch-Dice loss respectively. The first two columns demonstrate that the network trained with Patch-Dice loss can detect small actions that are missed or partially detected if BCE loss was used. The third column demonstrates that the localization masks generated using the Patch-Dice loss have better action boundaries.
    • Qualitative results of our system on some sample local evaluation videos. Each row is a sample output from our system, showing spatio-temporal localization and classification of actions in specific frames of the input video. Each action type is shown with different colored bounding box. The example activities shown here are vehicle turns left, vehicle reverses, vehicle starts, person talks to person, and person opens vehicle door. These results demonstrate the ability of our system in handling variation of object scales and detecting multiple action classes.