Video data is explosively growing due to ubiquitous video acquisition devices. Recent years witness a surge of substantial video volumes from surveillance, health care, and personal mobile phones, to name a few. Meanwhile, thanks to social media and on-demand video streaming industry, videos and corresponding textual descriptions are being produced and stored every moment, continuously. This amount of video + text “big data” makes today the best time ever for Computer Vision and Machine Learning (ML) to introduce and solve tasks related to a common understanding of videos and text. Furthermore, a practical analysis of such amount of video and text data is impossible for humans. Text and video joint-understanding approaches are solutions to various real-world use-cases, like social media analysis, video search engines, etc. These practical solutions change the way that we utilize the available data and have an impact on the industry, social security, and the future research tracks.
This dissertation makes contributions to the above tasks by proposing:
(1) A novel framework for multi-concept video retrieval model which utilizes inter and intra-shots correlations (2) A Spatio-temporal attention model to solve a novel form of Visual Question Answering (3) A new research problem, Visual Text Correction, to detect and correct inaccuracies in video descriptions(4) A Generative model to generate videos from natural language sentences in wild datasets by constructing a latent path.