Center for Research in Comptuer Vision
Center for Research in Comptuer Vision



Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions (ICCV 2017)

Introduction

In computer vision, due to Deep Convolutional Neural Networks (CNNs) dramatic success has been achieved in detection (e.g. object detection) and classification (e.g. action classification). Likewise, Recurrent Neural Networks (RNN) have been demonstrated to be very useful in Natural Language Processing (NLP) for language translation. Recently, new problems such as Visual Captioning (VC) and Visual Question Answering (VQA) have drawn a lot of interest, as these are very challenging problems and extremely valuable for both computer vision and natural language processing. Both Visual Captioning and Visual Question Answering are related to the Video-Fill-in-the-Blank (VFIB) problem, which is addressed in this paper.


Given a video and a descriptive sentence about that with one or more blanks, can we find the missing word?


Our Approach


To find the missing word, we incorporate three modules named 1)Textual encoder, 2) Spatial attention, and 3) Temporal encoder.



Our sentence encoder method learns to represent a fragmented sentence by encoding each fragment twice. It encodes each fragment independently for the first time and respect to the opposite side for the second time.


Spatial(left) and temporal(right) Attention models find the most important spatial and temporal parts of the video respect to the input text.



Downloads

PDF file of the paper can be downloaded. here .
LSMDC dataset and also more info about the challenge are available here: here .
Slides are available here: Slides

Related Publications

Amir Mazaheri Dong Zhang Mubarak Shah, "Video Fill In the Blank using LR/RL LSTMs with Spatial-Temporal Attentions", in Proceedings of IEEE International Conference on Computer Vision (ICCV), October 2017 [PDF]

Back to Computer Vision and NLU Projects