This dissertation addresses the problem of semantically describing images, a fundamental task that narrows down the semantic gap between visual reasoning of humans and machines. Visual attributes and textual tags allow us to naturally characterize objects (e.g “person” and “makeup”), actions (e.g “wearing”), and relationships (e.g “person wearing makeup”) both individually, grounded to the local properties, and in the global context of the entire scene.
Automatic image annotation assigns relevant textual tags to the images. We propose a mathematical framework based on Non-negative Matrix Factorization to perform automatic image annotation such that the proposed technique can seamlessly adapt to the continuous growth of datasets. Our proposed query-specific approach is built on the features of nearest-neighbors and tags. It naturally solves the problem of feature fusion and handles the challenge of rare tags by introducing weight matrices that penalize for incorrect modeling of less frequent tags and images that are associated with them.
Despite their effectiveness, the descriptive power of tags scales linearly with the number of unique training labels. In contrast, attributes being category-agnostic, allow an exponential number of semantic classes to be modeled. Due to above superiority, next, we focus on visual attributes. We hypothesize that integrating pixel-level semantic parsing of the face and human body should improve person-related attribute prediction. In this regard, we propose Semantic Segmentation-based Pooling (SSP) and Gating (SSG). In SSP, the estimated segmentation masks pool the output activations of the last (before classifier) convolutional layer at multiple semantically homogeneous regions, unlike global average pooling that is spatially agnostic. In SSG, we create multiple copies where each preserves the activations within a single semantic region and suppresses otherwise. This mechanism prevents max-pooling from mixing semantically inconsistent regions. SSP and SSG while effective, impose heavy memory utilization. To tackle that, Symbiotic Augmentation (SA) is proposed, where we learn to generate only one mask per activation channel.
The massive number of self-portrait images shared on social media is revolutionizing the way people introduce themselves to the world. Due to the Big Data nature of Selfies, it is nearly impossible to analyze them manually. Next, we use both textual tags and visual attributes to analyze Selfies. We collect the first Selfie dataset with more than 46K images and annotate it with 36 visual attributes covering characteristics such as gender, age, race, and hairstyle. We provide attribute prediction of Selfies, using SIFT and HOG, pre-trained AlexNet on ImageNet, and Adjective Noun Pairs (e.g “smiling boy”) of SentiBank. We train l2-regularized SVR for the log2-normalized view counts in order to assess the impact of different visual concepts and various Instagram filters on the popularity of Selfies.
Almost all today’s deep convolutional neural architectures, including those that we propose in this dissertation, use Batch Normalization (BN), yet the characteristics of BN are not sufficiently studied in the literature. We conclude this dissertation by showing that assuming samples within a mini-batch are from the same probability density function, then BN is identical to the Fisher vector of a Gaussian distribution. That means batch normalizing transform can be explained in terms of kernels that naturally emerge from the probability density function that models the generative process of the underlying data distribution. Specifically, we theoretically demonstrate how BN can be improved by disentangling modes of variation in the underlying distribution of layer outputs. An extensive set of experiments confirms that our proposed alternative to BN, Mixture Normalization, not only effectively accelerates training of different batch-normalized architectures including Inception-V3, DenseNet, and DCGAN, but also achieves better generalization error.