Skip to main content

Computational Model for Aesthetic Assessment of Videos



In this paper we propose a novel aesthetic model emphasizing psychovisual statistics extracted from multiple levels in contrast to earlier approaches that rely only on descriptors suited for image recognition or based on photographic principles. At the lowest level, we determine dark-channel, sharpness and eye-sensitivity statistics over rectangular cells within a frame. At the next level, we extract Sentibank features (1,200 pre-trained visual classifiers) on a given frame, that invoke specific sentiments such as “colorful clouds”, “smiling face” etc. and collect the classifier responses as framelevel statistics. At the topmost level, we extract trajectories from video shots. Using viewer’s fixation priors, the trajectories are labeled as foreground, and background/camera on which statistics are computed. Additionally, spatio-temporal local binary patterns are computed that capture texture variations in a given shot. Classifiers are trained on individual feature representations independently. On thorough evaluation of 9 different types of features, we select the best features from each level – dark channel, affect and camera motion statistics. Next, corresponding classifier scores are integrated in a sophisticated low-rank fusion framework to improve the final prediction scores. Our approach demonstrates strong correlation with human prediction on 1,000 broadcast quality videos released by NHK as an aesthetic evaluation dataset.

Method Summary and Results

Automatic aesthetic ranking of images or videos is an extremely challenging problem as it is very difficult to quantify beauty. That said, computational video aesthetics has received significant attention in recent years. With the deluge of multimedia sharing websites, research in this direction is expected to gain more impetus in future, apart from the obvious intellectual challenge in scientific formulation of a concept as abstract as beauty.

We propose a hierarchical framework encapsulating aesthetics at multiple levels which can be used independently or jointly to model beauty in videos. Our contributions are: (1) We extract motion statistics that latently encode cinematographic principles, specific to foreground and background or camera, superior to approaches previously proposed, (2) We introduce application of human sentiment classifiers on image frames that capture vital affective cues which are directly correlated with visual aesthetics and are semantically interpretable, (3) We employ a relatively small set of low-level psychovisual features in contrast to earlier approaches and encode them efficiently into descriptors that capture spatial variations within video frames, and finally, (4) We exploit a more sophisticated fusion scheme that shows consistent improvement in overall ranking when compared to earlier methods.

Selecting an optimal algorithm for fusing knowledge from individual models, boosts overall performance. More detailed analysis of fusion is provided in Fig. 4. It is evident that models built on shot-level features: camera/background motion perform well with smaller vocabulary sizes, in comparison to the ones that are trained using texture descriptors and sentibank features – which perform better with larger vocabularies.

We also observe that, shot level features : camera/background and foreground motion statistics outperform all other features in most cases. It is also interesting to note that frame-level sentibank features also perform equally well. Among cell-level features, dark channel based statistics demonstrate best performance. This is further supported by the scatter plot in Fig. 3(a). Ideally, the dots in the figures are expected to be along the line that connects (1; 1) to (1000; 1000), indicating no discordance between ground truth rank and predicted rank. We notice that magenta dots, corresponding to dark channel features are more aligned to the line, than those belonging to sharpness or eye-sensitivity.

However, in case of shot level features, we observe a larger degree of agreement between the ground truth and the prediction, in comparison to the cell-level features. This is reflected in Fig. 3(b). Finally, in Fig. 4, we provide some results using classifiers trained on frame-level affect based features (Orange). Except for a few ( less than 5%), we see strong label agreement between ground truth and prediction – similar to the shot-level features. This encourages us to fuse the individual classifier outcomes in 3 different settings. These results are also shown in Fig. 4. Fusion of classifiers trained on affect and dark channel based features are indicated in Turquoise. These results are further improved after adding classifiers trained on camera/background motion (Lime green). Ultimately fusion results from classifiers trained on top 5 features is shown in Pink. We notice strong level of concordance in this setting.


MATLAB code for computing aesthetic representation of a frame is available here. Dataset is available from NHK upon request.

Related Publication

Subhabrata Bhattacharya, Behnaz Nojavanasghari, Tao Chen, Dong Liu, Shih-Fu Chang, and, Mubarak Shah, Towards a Comprehensive Computational Model for Aesthetic Assessment of Videos, In Proc. of ACM International Conference on Multimedia (MM), Barcelona, ES, pp. 361-364, 2013.[ACM MM 2013 Grand Challenge 2nd Prize]