Deep Learning Human Mind for Automated Visual Classification
“Learning never exhausts the mind”
What if we could effectively read the mind and transfer human visual capabilities to computer vision methods? In this work, we aim at addressing this question by developing the first visual object classifier driven by human brain signals. In particular, we employ EEG data evoked by visual object stimuli combined with Recurrent Neural Networks (RNN) to learn a discriminative brain activity manifold of visual categories in a reading the mind effort. Afterward, we transfer the learned capabilities to machines by training a Convolutional Neural Network (CNN)–based regressor to project images onto the learned manifold, thus allowing machines to employ human brain–based features for automated visual classification. We use a 128-channel EEG with active electrodes to record brain activity of several subjects while looking at images of 40 ImageNet object classes. The proposed RNN-based approach for discriminating object classes using brain signals reaches an average accuracy of about 83%, which greatly outperforms existing methods attempting to learn EEG visual object representations. As for automated object categorization, our human brain–driven approach obtains competitive performance, comparable to those achieved by powerful CNN models and it is also able to generalize over different visual datasets. This gives us a real hope that, indeed, human mind can be read and transferred to machines.
- Reading the mind phase: A low-dimensional representation for temporal EEG signals recorded while users looked at images is learned by the encoder module. Then, the computed EEG features are employed to train an image classifier.
- Transferring human visual capabilities to machines phase: A CNN is trained to estimate EEG features directly from images; then, the classifier trained in the previous stage can be used for automated classification without the need of EEG data for new images.
The EEG multi-channel temporal signals, are provided as input to the encoder module, which processes the whole time sequence and outputs an EEG feature vector as a compact representation of the input. Ideally, if an input sequence consists of the EEG signals recorded while looking at an image, our objective is to have the resulting output vector encode relevant brain activity information for discriminating different image classes. The encoder network is trained by adding, at its output, a classification module (in all our experiments, it will be a softmax layer), and using gradient descent to learn the whole model’s parameters end-to-end. In our experiments, we tested several configurations of the encoder network:
- Common LSTM: the encoder network is made up of a stack of LSTM layers. At each time step t, the first layer takes the input s(·, t) (in this sense, “common” means that all EEG channels are initially fed into the same LSTM layer); if other LSTM layers are present, the output of the first layer (which may have a different size than the original input) is provided as input to the second layer and so on. The output of the deepest LSTM layer at the last time step is used as the EEG feature representation for the whole input sequence.
- Channel LSTM and Common LSTM: the first encoding layer consists of several LSTMs, each connected to only one input channel. In this way, the output of each “channel LSTM” is a summary of a single channel’s data. The second encoding layer then performs inter-channel analysis, by receiving as input the concatenated output vectors of all channel LSTMs. As above, the output of the deepest LSTM at the last time step is used as the encoder’s output vector.
- Common LSTM and Output layer : similar to the common LSTM architecture, but an additional output layer (linear combinations of input, followed by ReLU nonlinearity) is added after the LSTM, in order to increase model capacity at little computational expenses (if compared to the two-layer common LSTM architecture). In this case, the encoded feature vector is the output of the final layer.
Encoder and classifier training is performed through gradient descent by providing the class label associated to the image shown while each EEG sequence was recorded. After training, the encoder can be used to generate EEG features from an input EEG sequences, while the classification network will be used to predict the image class for an input EEG feature representation, which can be computed from either EEG signals or images, as described in the next section.
Regressing Images to EEG features
We employed two CNN-based approaches to extract EEG features (or, at least, a close approximation) from an input image:
- Approach 1: End to end training. The first approach is to train a CNN to map images to corresponding EEG feature vectors. Typically, the first layers of CNN attempt to learn the general (global) features of the images, which are common between many tasks, thus we initialize the weights of these layers using pre-trained models, and then learn the weights of last layers from scratch in an end to end setting. In particular, we used the pre-trained AlexNet CNN, and modified it by replacing the softmax classification layer with a regression layer (containing as many neurons as the dimensionality of the EEG feature vectors), using Euclidean loss as objective function.
- Approach 2: Deep feature extraction followed by regressor training. The second approach consists of extracting image features using pre-trained CNN models and then employ regression methods to map image features to EEG feature vectors. We used our fine-tuned AlexNet , GoogleNet and VGG as feature extractors by reading the output of the last fully connected layer, and then applied several regression methods (namely, k-NN regression, ridge regression, random forest regression) to obtain the predicted feature vectors.
The EEG Dataset
Six subjects (five male and one female) were shown visual stimuli of objects while EEG data was recorded. All subjects were homogeneous in terms of age, education level and cultural background. The dataset used for visual stimuli was a subset of ImageNet , containing 40 classes of easily recognizable objects. During the experiment, 2,000 images (50 from each class) were shown in bursts for 0.5 seconds each. A burst lasted for 25 seconds, followed by a 10-second pause where a black image was shown for a total running time of 1,400 seconds (23 minutes and 20 seconds). The experiments were conducted using a 128-channel cap with active, low-impedance electrodes (actiCAP128Ch). Brainvision DAQs and amplifiers were used for the EEG data acquisition. Sampling frequency and data resolution were set, respectively, to 1000 Hz and 16 bits.
- Encoder performance: The three LSTM encoding architectures were tested on the dataset described in the section above, achieving the following performance:
|Model||Detail||Max VA||TA at max VA|
|64, 64 common||75.9%||72.5%|
|128, 64 common||79.1%||76.8%|
|128, 128 common||79.7%||78.0%|
|Channel plus Common||5 Channel, 32 common||75.7%||72.9%|
|5 Channel, 64 common||74.3%||71.2%|
|Common plus Output||128 common, 64 output||81.6%||78.7%|
|128 common, 128 output||85.4%||82.9%|
- Regressor performance: According to the results shown in the previous section, the best encoding performance is obtained given by the common 128-neuron LSTM followed by the 128-neuron output layer. This implies that our regressor takes as input single images and provides as output a 128-feature vector, which should ideally resemble the one learned by the encoder.To test the regressor’s performance, we used the same ImageNet subset and the same image splits employed for the RNN encoder. However, unlike the encoder’s training stage, where different subjects generated different EEG signal tracks even when looking at the same image, for CNN-based regression we require that each image be associated to only one EEG feature vector, in order to avoid “confusing” the network by providing different target outputs for the same input. We tested two different approaches for selecting the single feature vector associated to each image:
- Average: the EEG feature vector associated to an image is computed as the average over all subjects when viewing that image.
- Best: for each image, the associated EEG feature vector is the one having the smallest classification loss over all subjects during RNN encoder training.
The results can be seen in the table below (MSE = Mean Square Error, FT = Fine Tuned, FE = Feature Extractor):
Feature set AlexNet FT AlexNet FE GoogleNet VGG k-NN Ridge RF k-NN Ridge RF k-NN Ridge RF Average 1.86 1.64 1.53 1.52 0.62 1.88 0.93 0.73 1.53 0.94 Best 2.12 1.94 1.62 1.56 3.54 7.06 4.01 3.26 7.63 4.45