Scene Labeling Using Sparse Precision Matrix
Introduction
Semantic image segmentation, assigning a label from many classes to each pixel of an image, is a challenging task in computer vision, due to the effort needed to segment and recognize the image simultaneously. Commonly used approaches ignore to incorporate long-range connections and model the contextual relationships among labels. In our method we aim to model interactions between labels and segments as an energy minimization over a Graph, Whose structure is captured by Inverse of Covariance matrix (Precision Matrix) and encodes only significant interactions. We use local and global information of a scene to improve the results.(a) shows a query image, (b) shows the human annotated image (ground truth), (c) shows labels obtained by classifiers, (d) shows labels via spatial smoothing and (e) shows our results
Method
Our approach consists of two main steps. The first step consists of off-the-shelf parts including feature extraction and classifier training based on local features of the sample training images. Also, in this phase using the training data, we capture the structure of semantic label interactions graph to be later employed in the pair-wise cost computation. In the second step, which is the inference, for a given query image, using scores computed by the classifiers for each possible label, and the pair-wise costs obtained by label correlations and appearance features of the image, the MAP inference in CRF framework is applied and each super-pixel is assigned a label. An overview of proposed approach is depicted in the following figure.We begin by extracting the feature matrix, and segmenting the image into super-pixels. Then classifiers (random forest) are trained. We detect the relations between labels using the sparse estimated partial correlation matrix of training data. In the inference part, for a given image the label scores are obtained via the classifiers (random forest and nearest neighbor), then the energy function of a sparse graphical model on super-pixels is optimized to label each super-pixel.
In training, first we segment images using efficient graph based segmentation. Next, for each super-pixel, local features, including SIFT, color histogram, mean and standard derivation of color, area and texture, are extracted. Given these local features, classifiers (random forest) are trained to label super-pixels using their local features. Also, in training phase we build the sparse precision matrix based on the sample data to highlight the important relations (positive or negative correlations) between labels. In testing, for a query image we find the unary terms, for its segments, using scores from local classifiers refined with the probabilities obtained from a retrieval set based on global features. Then, we use a fast implementation of graphical lasso to find the structure of the dependency graph between superpixels and assign weights to edges based on correlation values.
The following graph depicts label graph for SIFTFlow dataset.
Results
We evaluate our method on three benchmark datasets. The first dataset is Stanford-background, which has 8 classes and 715 images. The second dataset that we assess our approach with is SIFTflow dataset, which consists of 2,488 training images and 200 testing images from 33 classes collected from LabelMe. We also applied our method on third dataset, MSRCV2, which has 591 images of 23 classes. We use the provided split, 276 images in training and 255 images. Some qualitative results are shown in in the following figure:From left to right: examples of images, ground truth, spatial smoothing results and our results.
The following tables present the results of accuracy of our method and the state of the art methods on the benchmark datasets.