
Abstract
We lay out a human learning-based framework for studying computational multimodal comprehension. Current visual question answering (VQA) systems treat all questions as equal and have no notion of comprehension. Elementary school (K-5) teaching of reading comprehension on the other hand has a graded approach based on a hierarchy of skills that covers the range from memorization to content creation. In our research we take inspiration from such hierarchies to investigate both dataset creation and question answering techniques. First, we are currently creating a new visual question answering dataset that tests comprehension of VQA systems in a graded manner using hierarchical question answering with picture stories. Second, we investigate large language models such as GPT-Neo, the open version of GPT-3. Current pre-trained language models have lots of knowledge, but a more limited ability to use that knowledge. Bloom’s Taxonomy helps educators teach children how to use knowledge by categorizing comprehension skills, so we use it to analyze and improve the comprehension skills of large pre-trained language models. Our experiments focus on zero-shot question answering, using the taxonomy to provide proximal context that helps the model answer questions by being relevant to those questions. We show that targeting context in this manner improves performance across 4 popular common sense question answer datasets. Finally, we present work on detection and removal of bias in common multimodal machine comprehension datasets. We hypothesize that this naturally occurring bias present in the dataset affects even the best performing model. We verify our proposed hypothesis and propose an algorithm capable of modifying the given dataset to remove the bias elements.
For more info, please follow this link.