Skip to main content


A representational gap exists between low-level measurements (segmentation, object classification, tracking) and high-level understanding of video sequences. We present a novel representation of events in videos to bridge this gap, by extending the CASE representation of natural languages. The proposed representation has three significant contributions over existing frameworks. First, it is able to represent temporal structure and causality between sub-events. Second, it is capable of representing multi-agent and multi-threaded events. Last, it can detect events in an automated manner by subtree pattern matching. In order to validate our representation and automatically annotate the videos, we report experiments on several events occurring in two different domains, namely meeting video sequences, and for videos of railroad crossings.

Results for different sequences
Hand Raise (PETS video)
Stealing (USC video)
Railroad Monitoring
Sequence of frames for Railroad Monitoring videos
Sequence of frames for Meeting videos


Related Publications

Asaad Hakeem and Mubarak Shah, Learning, Detection and Representation of Multi-Agent Events in Videos, Artificial Intelligence, 171 (2007) 586-605. Asaad Hakeem, Yaser Sheikh, and Mubarak Shah, CASEE: A Hierarchical Event Representation for the Analysis of Videos, The Nineteenth National Conference on Artificial Intelligence (AAAI), 2004.