
Final Oral Examination for Doctor of Philosophy (Computer Science)
Akash Kumar
Monday, October 13, 2025
2:00PM – 3:00PM
[Bifold]
Dissertation
Deep learning has significantly advanced visual understanding tasks like object detection, tracking, and spatio-temporal grounding, benefiting fields such as autonomous vehicles, surveillance systems, and robotics. While large-scale labeled datasets have been crucial to this success, the process of spatiotemporally labeling video requires immense human effort, creating a major bottleneck for scaling this technology. Furthermore, progress has largely been confined to closed-world settings, hindering the ability of models to handle freeform queries and new concepts in unconstrained, real-world environments.
This dissertation addresses these fundamental challenges through several key contributions. First, to reduce annotation dependency, we introduce semi-supervised learning frameworks for dense tasks like video action detection. By incorporating spatio-temporal coherence constraints and student-teacher refinement, our methods achieve performance competitive with fully supervised models while using only a fraction of the labeled data. Second, to bridge the gap to open-world understanding, we propose novel methods for weakly-supervised learning. We introduce foundation models that ground natural language queries by employing contextual and progressive learning paradigms. These models learn to interpret compositional actions and navigate complex scenes without bounding box annotations, significantly enhancing their ability to generalize to unconstrained, real-world scenarios.