This material is based upon work supported by the National Science Foundation under Grant No. 1741431.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Award Number: 1741431
Duration (expected): 3 years
Award Amount: $662,431.00
Award Title: BIGDATA: IA: Distributed Semi-Supervised Training of Deep Models and Its Applications in Video Understanding
PI: Mubarak Shah
Co-PI: Liqiang Wang
Student(s): (Supported) Ehsan Kazemi Foroushani, Siavash Khodadadeh, Jyoti Kini, Yandong Li, Zixia Liu, Amir Mazaheri, Dongdong Wang, Yang Zhang; (Co-Authors) Pooya Abolghasemi, Deliang Fan, Yunhui Guo, Mahdi M. Kalayeh, Nasim Souly, Xiang Wei
We propose to design new algorithms to explore distributed Learning with Less Labels (LWL) including unsupervised, semi-supervised, active learning and meta learning training strategies for the deep models as well as its applications to computer vision and machine learning problems including semantic segmentation, video retrieval, image and video synthesis and video human action recognition/localization/segmentation, visuomotor object manipulation etc. We also conduct fundamental research on unsupervised learning techniques such as structure-preserving data selection and multi-norm batch normalization.
To leverage the labeled and unlabeled data at scale, we are designing novel hybrid synchronized and asynchronous frameworks for efficient distributed training. We emphasize that whereas we expect to make novel discoveries in above mentioned domains, our integrated solution to the distributed semi-supervised training of deep models could be applied to other scientific domains. The proposed research will enrich the family of deep learning methods by modeling the inherent characteristics of data to enable effective learning from the distributed, and heterogeneous image and video data. To further facilitate faster and scalable training of the deep models, we are also developing a novel parallel computing framework tailored for the deep neural networks. The carefully chosen application domains enable us to additionally explore the temporal structures and models.
The idiosyncrasies of the deep models and the distributed unlabeled data prevent us from directly applying existing semi-supervised algorithms. First of all, the existing methods mostly rely on fixed data representations (e.g., handcrafted features of images and acoustic signals), while a key distinction of deep models from the other machine learning methods is that they automatically learn the data representations from raw data (e.g., images). Besides, the computational challenge becomes more profound in the semi-supervised learning: The unlabeled data are often unstructured and distributed, making it infeasible to transmit or store the data to a single machine. How shall we design the architectures of the deep models to facilitate distributed training? How to use unlabeled data to facilitate semi-supervised learning? How to analyze video effectively? Our team endeavours to work on these challenging research questions.
Current/Final Results (summary)
 Unsupervised Meta-Learning For Few-Shot Image and Video Classification (NeurIPS 2019)
Few-shot classification refers to classify N different concepts based on just a few examples of them. Few-shot learning refers to methods or techniques which enables deep neural networks to learn a few-shot classification task by just few samples. Few-shot or one-shot learning of classifiers requires a significant inductive bias towards the type of task to be learned. One way to acquire this is by meta-learning on tasks similar to the target task. In order to achieve this, meta-learning requires access to many different few-shot learning tasks and aims to learn how to learn those tasks from scratch with just a few samples for each. When we face a new classification task (target task), we hope to perform it by the network which is meta-learned on how to learn classification tasks with just a few samples. Note that the target classification task does not share any of its classes with meta-learning tasks. The subjects should not have any intersection with the tasks which we learned how to classify.
Result: In this paper, we propose UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning for classification tasks. UMTRA does not require label information during meta-learning. In other words, UMTRA trades off some classification accuracy for a reduction in the required labels of several orders of magnitude.
We have developed a new algorithm for finding structure-perserving representative and show its superiority of the proposed algorithm on active learning for video action recognition dataset on UCF-101; learning using representatives on ImageNet; training a generative adversarial network (GAN) to generate multi-view images from a single-view input on CMU Multi-PIE dataset; and video summarization on UTE Egocentric dataset.
Result: Our Iterative Projection Method (IPM ) achieves 30% on using only 50 samples, compared to 48% employing 1.2 million samples.
We have devleoped new algorithm for traning deep networks by separatign modes of vatiation in Batch Normalized Models. Through extensive set of experiments on CIFAR-10 and CIFAR-100, using both a 5-layers deep CNN and modern Inception-V3 architecture, we show that mixture normalization reduces required number of gradient updates to reach the maximum test accuracy of the batch normalized model by ∼31%-47% across a variety of training scenarios.
Result: Our mixture normalization algorithm reduces required number of gradient updates to reach the maximum test accuracy of the batch normalized model by ∼31%-47% across a variety of training scenarios.
 An Efficient 3D CNN for Action/Object Segmentation in Video (BMVC 2019)
We have developed a new algorithm for the video object segmentation in the unsupervised setting. We propose an end-to-end encoder-decoder style 3D CNN based method to solve the video object segmentation problem efficiently. We evaluate our method on video object segmentation and action segmentation benchmarks and demonstrate state-of-the-art performance. Extensive experiments on several video datasets demonstrate the superior performance of the proposed approach for action and object segmentation compared to the state-of-the-art.
Result: Our Efficient 3D CNN model for video object segmentation achieves accuracy of 77.4% with 6 Billion parameters and 11 MB employing separable convolution compared to Standard 3D Conv which achieves 77.1% accuracy with 136 Billion parameters and 255 MB.
 Robustifying Visuomotor Policy by Task Focused Visual Attention (CVPR 2019)
We have developed an approach for augmenting a deep visuomotor policy trained through demonstrations with Task Focused visual Attention (TFA). The manipulation task is specified with a natural language text such as “move the red bowl to the left”. Averaged over all the objects, the recovery rate is only 5% for the baseline policy in pickup and push tasks, while it is 57% for the policy with the TFA.
Result: Our Task Focused visual Attention (TFA) visuomotor policy achieves recovery rate 57% for pickup and push tasks compared to 5% for the baseline policy.
 Photography and Exploration of Tourist Locations Based on Optimal Foraging Theory (IEEE CSVT 2019)
We have developed a new algorithm for location recommendation to explore a particular tourist attraction from the photography perspective. The recommendation is learned from YFCC100M dataset which has around 98M user-generated images with around 300K users. We make use of social media images and associated meta-data to understand the past tourist behavior and the location environment.
Result: Our Foraging Theory based method utilizes big multimedia data shared on social media platforms and can generate effective personalized tour recommendation in real-time with 42% improvement in similarity performance.
In this project, we propose a novel probabilistic model, built upon SeqDPP, to dynamically control the time span of a video segment upon which the local diversity is imposed. In particular, we enable SeqDPP to learn to automatically infer how local the local diversity is supposed to be from the input video. The resulting model is extremely involved to train by the hallmark maximum likelihood estimation (MLE), which further suffers from the exposure bias and non-differentiable evaluation metrics. To tackle these problems, we instead devise a reinforcement learning algorithm for training the proposed model. Extensive experiments verify the advantages of our model and the new learning algorithm over MLE-based methods.
Result: Our video summarization algorithms utilize local diversity as a guideline to tackle the dynamic diverse selection problem, where our both SeqDPP and DySeqDPP algorithms significantly outperform the state-of-the-art methods and DySeqDPP ranks to the first by a big margin (around 4%) on the benchmark SumMe.
In this project, we propose a multi-domain learning architecture based on depthwise separable convolution. The proposed approach is based on the assumption that images from different domains share cross-channel correlations but have domain-specific spatial correlations. The proposed model is compact and has minimal overhead when being applied to new domains. Additionally, we introduce a gating mechanism to promote soft sharing between different domains.
Result: In the project of investigating reusable structure in models, our depth-wise separable convolution based architecture shows the highest score while only requiring 50% of the parameters compared with the state-of-the-art approaches, according to an experiment on Visual Decathlon Challenge.
In this project, we propose a black-box adversarial attack algorithm that can defeat both vanilla DNNs and those generated by various defense techniques developed recently. Instead of searching for an “optimal” adversarial example for a benign input to a targeted DNN, our algorithm finds a probability density distribution over a small region centered around the input, such that a sample drawn from this distribution is likely an adversarial example, without the need of accessing the DNN’s internal layers or weights. Our approach is universal as it can successfully attack different neural networks by a single algorithm. It is also strong; according to the testing against 2 vanilla DNNs and 13 defended ones, it outperforms state-of-the-art black-box or white-box attack methods for most test cases.
Result: In the black-box attaching project, our algorithm defeats 13 defended DNNs, better than or on par with state-of-the-art white-box attack methods.
Listed below are the major accomplishments:
- IPM was found to be an efficient and robust method for structure-preserving representative selection which can be generalized to a wide range of problem domains.
- A training cost optimization based on separating mode variation was observed to be robust across networks with varying size.
- The use of separable convolutions for video object segmentation is effective and leads to drastic reduction in memory consumption.
- The text based visual attention was found to be effective for visuomotor policy learning and robust in the presence of visual and physical disturbances.
- The integration of behavior science from foraging theory was found to be useful in providing effective personalized recommendation which considers both content as well as user context.
- In the black-box attaching project, we investigate the transferability of adversarial examples across the defended DNNs, interestingly, we observe that unlike the high transferability across vanilla DNNs, it is difficult to transfer the attacks on the defended DNNs.
- In the project of distributed machine learning, we observe that asynchronous proximal algorithms can be highly efficient when being used to solve large scale non-convex non-smooth problems.
 Khodadadeh, Siavash, Ladislau Bölöni, and Mubarak Shah. Unsupervised Meta-Learning For Few-Shot Image and Video Classification. 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. [Project Page] [Code]
 Zaeemzadeh, Alireza, Joneidi, Mohsen, Rahnavard, Nazanin, Shah, Mubarak: Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, June 16-20, 2019. [Project Page] [Code]
 M. M. Kalayeh and M. Shah, “Training Faster by Separating Modes of Variation in Batch-normalized Models,” in IEEE Transactions on Pattern Analysis and Machine Intelligence. doi: 10.1109/TPAMI.2019.2895781
 Rui Hou, Chen Chen, Rahul Sukthankar, Mubarak Shah, An Efficient 3D CNN for Action/Object Segmentation in Video, British Machine Vision Conference (BMVC 2019), UK, Sep 9-10, 2019.
 Abolghasemi, Pooya, Amir Mazaheri, Mubarak Shah, and Ladislau Boloni, Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2019), Long Beach, CA, June 16-20, 2019. [Project Page] [Code]
 Rawat, Yogesh Singh, Mubarak Shah, and Mohan S. Kankanhalli. Photography and Exploration of Tourist Locations Based on Optimal Foraging Theory. IEEE Transactions on Circuits and Systems for Video Technology (2019). [Code]
 Yandong Li, Liqiang Wang, Tianbao Yang, and Boqing Gong. How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization. In the European Conference on Computer Vision (ECCV) 2018. Munich, Germany. Sept. 8-14, 2018. [Code]
 Yandong Li#, Yunhui Guo#, Liqiang Wang, and Tajana Rosing. Depthwise Convolution is All You Need for Learning Multiple Visual Domains. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). Honolulu, Hawaii, USA. 2019. (# Equal contribution.)
 Ehsan Kazemi* and Liqiang Wang. Asynchronous Delay-Aware Accelerated Proximal Coordinate Descent for Nonconvex Nonsmooth Problems. In The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI). Honolulu, Hawaii, USA. 2019.
 Yandong Li, Lijun Li, Liqiang Wang, Tong Zhang, Boqing Gong. NATTACK: Improved Black-Box Adversarial Attack with Normal Distributions. In 36th International Conference on Machine Learning (ICML). Long Beach, CA, USA. 2019. [Code]
 Souly, Nasim and Spampinato, Concetto and Shah, Mubarak. (2017). Semi Supervised Semantic Segmentation Using Generative Adversarial Network. IEEE International Conference on Computer Vision (ICCV). 5689 to 5697. doi:10.1109/ICCV.2017.606
 Mazaheri, Amir and Gong, Boqing and Shah, Mubarak. (2018). Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables. ACM Transactions on Multimedia Computing, Communications, and Applications. 14 (2) 1 to 21. doi:10.1145/3176647 [Code]
 Wei, Xiang and Gong, Boqing and Liu, Zixia and Lu, Wei and Wang, Liqiang. (2018). Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect. International Conference on Learning Representation (ICLR). [Code]
 Ding, Yifan and Wang, Liqiang and Fan, Deliang and Gong, Boqing. (2018). A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV). 1215 to 1224. doi:10.1109/WACV.2018.00138. [Code]
 “View Invariant and Few Shot Human Action Recognition”, Invited Talk, Simula Metropolitan Center for Digital Engineering, Oslo, March 14, 2019.
 “View Invariant and Few Shot Human Action Recognition”, Invited Talk, Workshop on Human Activity Detection in Multi-Camera Video Streams at the IEEE Winter Conf. on Applications of Computer Vision (WACV), Hawaii, January 7, 2019.
 “Self Flying Drones: Wide Area Aerial Video Analysis”, Vision Meets Drone: A Challenge, ECCV 2018 workshop, September 8, 2018, Munich Germany.
Data (with documentation)
 Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision [Data]
 Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention [Data]
 Visual Text Correction [Data]
Software Downloads (with documentation)
 Unsupervised Meta-Learning For Few-Shot Image and Video Classification [Code]
 Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision [Code]
 Pay attention!-Robustifying a Deep Visuomotor Policy through Task-Focused Attention [Code]
 Photography and Exploration of Tourist Locations Based on Optimal Foraging Theory [Code]
 How Local is the Local Diversity? Reinforcing Sequential Determinantal Point Processes with Dynamic Ground Sets for Supervised Video Summarization [Code]
 NATTACK: Improved Black-Box Adversarial Attack with Normal Distributions [Code]
 Learning a Multi-Concept Video Retrieval Model with Multiple Latent Variables [Code]
 Improving the Improved Training of Wasserstein GANs: A Consistency Term and Its Dual Effect [Code]
 A Semi-Supervised Two-Stage Approach to Learning from Noisy Labels [Code]
 Visual Text Correction [Code]
Impact on the developments of the principal disciplines of the project: Our research on active-learning, multi-norm batch-normalization, and distributed semi-supervised training of deep neural networks will benefit many applications in machine learning and computer vision. These algorithms can be seen as some standard techniques to facilitate training procedure, specifically in big-data problems where the training time and data selection are always important factors. Our contributions in this project will bring new opportunities to advance the state of the art in various machine learning and computer vision tasks.
Impact on other disciplines: The proposed techniques can potentially benefit any field of research which directly or indirectly leverages machine learning and big data.
Impact on the development of human resources: The proposed work contributes in training and education of students by enriching the syllabus of courses and also encourages students to get familiar with the latest advances in the field through course projects. This work also forms the subject of many Ph.D. dissertations written by graduate students working under the supervision of the PIs. The PI will also continue leading the REU and RET programs with the goal of increasing the diversity among doctoral degrees in STEM and also inclusion of women and minority graduate students.
Impact on technology transfer: All of our research material including the codes and datasets are available to the public. Since our research has direct impact on deep learning and machine learning applications, it may help others to develop many related algorithms, software, and applications.
Impact on society beyond science and technology: Our research is useful for multimedia applications on big data, which may have value for social scientists to analyse digitised human behavior records such as big video data from social media, textual feeds from news agencies and blogs, and etc.
The PI, Mubarak Shah, taught “Advanced Computer Vision” in Spring 2018 and Spring 2019, in which, many recent papers on solving Big Video data analysis problems in computer vision as well as employing semi-supervised learning were discussed in the class. Graduate students implemented an interest point detection/matching deep neural network (related self-supervised learning), and also implemented a semi-supervised video object segmentation deep neural network on of the largest available datasets (YouTube-VOS) as the final project.
The Co-PI Liqiang Wang taught “Parallel and Cloud Computing” in Fall 2019, which involves topics about distributed machine learning and big data processing. Graduate students were trained and implemented distributed machine learning projects on the cutting edge multiple GPU/CPU platforms.
In addition, during summer 2019, 10 undergraduate students and 6 high school teachers participated in the Research Experience for Undergraduates (REU) and Research Experience for Teachers (RET) programs in Computer Vision which were both funded by NSF. REU students worked on computer vision projects related to big video data applying semi-supervised and weakly supervised methods. RET participants were exposed to several applications of big data.
Several Ph.D. students have been involved in the research funded by this grant. We highlight a few here:
- Amir Mazheri worked on big video data problem of visual text correction, and semi-supervised robot manipulation using visual attention and textual information.
- Mahdi Kalayeh worked on a novel formulation of multi-norm batch normalization for faster training of deep networks. He graduated in May 2019 and is currently working at Netflix as a senior research scientist.
- Rui Hou worked on video action localization and segmentation using deep neural networks trained on big video datasets. Rui finished his PhD in June 2019 and currently working at Niantic, Inc in California.
- Ehsan Kazemi and Yandong Li were also involved in the proposed research. Yandong published three papers in ECCV, AAAI, and ICML, and Ehsan published one paper on AAAI during the last academic year.
 Trustee chair professor Dr. Mubarak Shah is the 2019 winner of the prestigious ACM SIGMM Award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications. Dr. Shah was selected for this award for his “outstanding and pioneering and continued research contributions in the areas of multimedia content analysis and multimedia applications, for leadership in education, and for outstanding and continued service to the community.” The award includes a $2000 honorarium and a 30-minute keynote talk at ACMMM 2019 in Nice, France.
 Amir Mazaheri won $5000 doctoral research fellowship award from the University of Central Florida, for the “Visual Text Correction” work.
Highlights, Press Release
ACM SIGMM Award for Outstanding Technical Contributions to Multimedia Computing, Communications and Applications
Point of Contact: Mubarak Shah
Date of Last Update: September 23, 2019