Skip to main content

Final Oral Examination for Doctor of Philosophy (Computer Science)

Rohit Gupta

Thursday, October 30, 2025
3:00PM – 4:00PM
Global 229
[Bifold]

Dissertation

Video is now a key medium for learning, communication, and autonomy, so perception systems must recognize fine-grained activities, adapt to new concepts, stay robust under change, and support multiple capabilities. Current methods fall short: they depend on costly labels, use closed vocabularies, inherit brittle representations, and most are point solutions lacking multi-task capabilities.

This dissertation closes those gaps as one approach toward general video understanding. It builds label-efficient, fine-grained recognition on educational content; expands to open-vocabulary multi-label recognition; strengthens the robustness of contrastive representations by identifying and removing false-negative pairs; and unifies captioning, question answering, retrieval, and localization in a single Video LLM that also produces strong embeddings. Together these contributions cut annotation costs, recognize unseen actions and entities, improve robustness, and enable multi-task video understanding, yielding an unified model stack for dependable and generalizable visual perception in the physical world.