Visual Saliency Detection Using Group Lasso Regularization in Videos of Natural Scenes
Introduction
Visual saliency is the ability of a vision system to promptly select the most relevant data in the scene and reduce the visual data that needs to be processed. Thus, its applications for complex tasks such as object detection, object recognition and video compression have attained interest in computer vision studies.In this paper, we introduce a novel unsupervised method for detecting visual saliency in videos of natural scenes.
High-saliency areas of a natural scene are the small portions that hold most important information and can be identied easily by human vision system.
Method
An overview of proposed approach is depicted in the following figure.
We begin with extracting the feature matrix, X, of a video and segmenting the video into super-voxels. A dictionary, D, is learned online and then the video is represented by F in terms of coecients Y obtained from group lasso regularization over the dictionary. Afterward, salient parts, represented by Sparse matrix (S), and non-salient parts (L) are achieved via low-rank minimization technique (Robust PCA). Finally, a saliency map is generated based on L1 norm of columns of matrix S belonging to super-voxels.
we divide a video into nonoverlapping cuboids and create a matrix whose columns correspond to intensity values of these cuboids. Simultaneously we segment the video using hierarchical segmentation method and obtain super-voxels.
Then, the video is represented as coecients of atoms from a dictionary learned from the feature data matrix of a video, and decomposed into salient and non-salient parts.
We propose to use group lasso regularization to find the sparse representation of a video, which benefits from grouping information provided by super-voxels and extracted features from the cuboids. We find saliency regions by decomposing the feature matrix of a video into low-rank and sparse matrices by using Robust Principal Component Analysis (RPCA) matrix recovery method.
Results
We have evaluated our method on four different data sets: INB dataset, which consists of 18 high-resolution movie clips of natural outdoor scenes, UCF Sports Action data, UCF Saliency data set, and Hollywood2 Actions dataset. This is a large scale dataset with camera motion and clutter. Some qualitative results are shown in in the following figure:Examples of frames from (a) UCF Sport data set videos, (b) super-voxels, (c) our results showing most salient regions plus gaze points shown in red considering calibration errors.
We have used the same experiments setup as aforementioned in the configuration for INB data set. In all data sets, we have shown that out method gives better results in terms of AUC.
the following chart presents the results of accuracy of our method and the state of the art methods on UCF Sports Data set.
AUC scores for videos in UCF Sports data set based on Default-Labeling configuration.