
Final Oral Examination for Doctor of Philosophy (Computer Science)
Jyoti Kini
Friday, October 31, 2025
2:00PM – 3:00PM
Research I, 101A
[Bifold]
Dissertation
Autonomous agents rely on robust segmentation, detection, and tracking to perceive, reason about, and act in dynamic environments. These perception capabilities form the core of intelligent systems, from self-driving vehicles navigating urban roads to aerial robots mapping complex environments, and embodied AI agents interacting seamlessly with people and objects. Yet current perception models remain constrained by fragmented architectures, heavy annotation dependence, and poor generalization to unseen viewpoints, which hinder scalability and reliability.
This dissertation advances the frontier of 2D and 3D scene understanding through end-to-end cohesive, label-efficient, and generalizable perception frameworks. The contributions span across spatio-temporal video instance segmentation, selfsupervised video object segmentation, end-to-end 3D detection and tracking in LiDAR, and cross-view open-vocabulary object detection in aerial imagery. Together, these advances address key challenges such as dense perception without region proposals, unsupervised object discovery, unified perception in 3D, and cross-view object understanding in remote sensing. Collectively, these lay the foundation for adaptive, label-efficient, and deployable perception systems capable of operating reliably in complex, real-world settings.