Skip to main content

Unsupervised Meta-Learning For Few-Shot Image and Video Classification



Siavash Khodadadeh, Ladislau Bölöni, and Mubarak Shah. “Unsupervised Meta-Learning For Few-Shot Image and Video Classification.” 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.


Few-shot classification refers to classify N different concepts based on just a few examples of them. Few-shot learning refers to methods or techniques which enables deep neural networks to learn a few-shot classification task by just few samples. Few-shot or one-shot learning of classifiers requires a significant inductive bias towards the type of task to be learned. One way to acquire this is by meta-learning on tasks similar to the target task. In order to achieve this, meta-learning requires access to many different few-shot learning tasks and aims to learn how to learn those tasks from scratch with just a few samples for each. When we face a new classification task (target task), we hope to perform it by the network which is meta-learned on how to learn classification tasks with just a few samples. Note that the target classification task does not share any of its classes with meta-learning tasks. The subjects should not have any intersection with the tasks which we learned how to classify. In this paper, we propose UMTRA, an algorithm that performs unsupervised, model-agnostic meta-learning for classification tasks. UMTRA does not require label information during meta-learning. In other words, UMTRA trades off some classification accuracy for a reduction in the required labels of several orders of magnitude.

Experimental Setup

We start with an unlabeled data set D of samples. We assume a large number of classes C present, but we don’t know the class of each sample. We generate a task by sampling some data points from D. These points might belong to the same class or not. Note that class information, however, is not an attribute of dataset. For example, one can put the whole CelebA dataset into two categories male or female, or we can put them in many different categories based on the identity of the people in the dataset. In other words, when you sample N data points from D, you can assume that they belong to N different classes. Given that we still do not want to pick samples from the same category as different tasks because that might have negative effects on meta-learning. In our paper, we statistically show that the probability of these instances belong to the same class is very low.

The next step is to generate a validation set. This is the hardest part. How can we pick other samples from the same classes in order to generate validation set. If we had label information, we could do this easily, however, we do not have that information. Another possibility would be to apply augmentation on these given samples in order to generate validation samples.

This figure shows how supervised meta-learning algorithms generate their tasks:

As shown in this figure, for each iteration of meta-learning, we sample two different data points from each class and create two sets for that task. Task’s train set and task’s validation set. We are not going into details of meta-learning algorithm itself. This is what we need at each iteration. You can see without supervision, we cannot sample instances from the same class.
This is where our approach comes in. Take a look at this figure:

This time we do not have class information. Instead we start from sampling data points from our dataset. We do not know the class information but we generate task’s train set. For task’s validation set, we augment those images. The goal of task’s validation set is to evaluate generalization error and update the network such that it minimize the generalization error. When we do not have class information we need augmentation which is able to generate samples which cover the class attributes.

After applying meta-learning, our algorithm works the same as when it is trained by supervised meta-learning algorithm. In other words, given a target task with few samples, we learn that task and evaluate it and see how our method works.

We call our method UMTRA: Algorithm

We compare the convergence speed on our method and supervised meta-learning algorithms. The following figures show that our algorithm is able to learn to quickly learn the given task just as supervised learning algorithms.

The accuracy curves during the target training task on Omniglot dataset for K = 1. The band around lines denotes a 95% confidence interval.

The accuracy curves during the target training task on the Mini-Imagenet dataset. Accuracy curves are shown for K = 1, 5, 20. The band around lines denotes a 95% confidence interval.


  • Few-Shot Learning Benchmarks
  • We evaluate our method on different datasets. Two famous benchmarks for this are Mini-Imagenet and Omniglot datasets. This table shows our method is effective in these benchmarks. We compared it with other unsuepervised learning and unsupervised meta-learning approaches. (Results on Mini-Imagenet and Omniglot)

  • Video Domain
  • In this section, we show how the UMTRA can be applied to video action recognition, a domain significantly more complex and data intensive than the one used in the few-shot learning benchmarks such as Omniglot and Mini-Imagenet. We perform our comparisons using one of the standard video action recognition datasets, UCF-101. UCF-101 includes 101 action classes divided into five types: Human-Object Interaction, Body-Motion Only, Human Human Interaction, Playing Musical Instruments and Sports. The dataset is composed of snippets of Youtube videos. Many videos have poor lighting, cluttered background and severe camera motion.

    As the classifier on which to apply the meta-learning process, we use a 3D convolution network, C3D. We build this on top of maml original code for our experiments. We will add this architecture to our Tensorflow 2.0 code version as well.

    Performing unsupervised meta-learning on video data, requires several adjustments to the UMTRA workflow, with regards to the initialization of the classifier, the split between meta-learning data and testing data, and the augmentation function. First, networks of the complexity of C3D cannot be learned from scratch using the limited amount of data available in few-shot learning. In the video action recognition research, it is common practice to start with a network that had been pre-trained on a large dataset, such as Sports-1M dataset, an approach we also use in all our experiments. Second, we make the choice to use two different datasets for the meta-learning phase Kinetics and for the few-shot learning and evaluation UCF-101. This gives us a larger dataset for training since Kinetics contains 400 action classes. The network is pre-trained on Sports-1M, meta-trained on Kinetics and few-shot trained on UCF-101. When using the Kinetics dataset we limit it to 20 instances per class. Third, in order to augment the videos, we have a new possibility. We simply choose temporally shifted video fragments from the same video. This figure shows some examples of this augmentation (Augmentation on video clips):

    In our evaluation, we perform 30 different experiments. At each experiment we sample 5 classes from UCF-101, perform the one-shot learning, and evaluate the classifier on all the examples for the 5 classes from UCF-101. As the number of samples per class are not the same for all classes, we report both the accuracy and F1-score (Results on video action recognition):


We have done unsupervised meta-learning and showed that it could be effective in different scenarios. However, one limitation is the choice of augmentation function. Our experiments show that this is an important decision. An interesting future work would be to come up with augmentation functions which are more automatic.

Thank you for reading this article. Please feel free to contact us if you have any questions with regards to this work.