Skip to main content

Final Oral Examination for Doctor of Philosophy (Computer Science)

Sijie Zhu

Thursday, November 14, 2022
11:00AM – 12:00PM
Trevor Colbourn Hall 351
[Bifold]

Dissertation

Given an image or a video, visual question answering (VQA) deals with a challenging task of answering a question related to the contents of the visual input. VQA has practical applications such as assisting people with visual impairments and helping radiologists for early diagnosis of fatal diseases. While VQA systems are increasingly finding real-world applications, there is a compelling need to provide these systems capabilities to explain their decisions. Such capabilities are imperative to improve a system’s reliability and trustworthiness. Attention is a mechanism used by VQA methods to link the text (question and answer) to specific visual regions, also referred to as grounding. Thus, VQA grounding is a means of verifying that correct visual content is being inspected to assess how the answer was determined. With its significance for critical applications, VQA also serves as a foundation of further research areas such as embodied AI, language-guided navigation, and visual dialogue.

This dissertation makes several contributions in the field of VQA and grounding by presenting: 1) new algorithm for multimodal question answering that processes each input modality individually and collectively as needed allowing to overcome language biases, hence improving performance for truly vision-based questions; 2) a mechanism to measure the reliability of VQA methods by verifying that the correct visual information is being inspected; 3) techniques to improve the interpretability of such methods in two types of neural networks: CNNs and transformers; 4) an efficient approach to learn a compact video representation, i.e., its underlying spatiotemporal scene-graph and utilize it to solve video question answering.