Skip to main content

Learning Semantic Visual Vocabularies Using Diffusion Distance


Bag of features (BOF) is receiving increasing attention due to its simplicity and surprisingly good performance on object, scene and action recognition problems. The underlying idea is that a variety of statistical cues are present in images and videos, such as color or edge patterns and local spatiotemporal patterns, which can be effectively used for recognition. However, there are some problems of BOF: Larger codebook size can achieve better performance, but it also results in sparse high-dimensional vectors; Visual words are not semantically meaningful. We propose a novel approach to further cluster visual words and generate a semantic vocabulary for visual recognition. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results.

Flowchart of Learning Semantic Visual Vocabulary

1. Major steps for constructing a semantic visual vocabulary using diffusion maps.


2. Raw feature extraction

  • Use spatiotemporal interest point detector for action recognition.
  • Use SIFT features for scene classification.
  • Use k-means clustering to quantize low-level features.
  • Use PMI to represent the midlevel features.


fxy=cxy/Ntcxy is the number of times feature y appears in image or video x.

Diffusion Maps

Construct weighted graph


Markov Transition Matrix

Diffusion distance

  • Goal: relate spectral properties of Markov chain to the geometry of the data.

Diffusion maps embedding

  • Diffusion distances can be computed using eigenvectors and Eigen values of P. The distance may be approximated by the first α Eigen values.


  • Diffusion map embedding


Advantages Over Other Manifold Approaches

  • Diffusion maps preserves local data structure, while PCA and ISOMAP are global techniques.
  • Diffusion maps is nonlinear approach as comparing to PCA which is linear method.
  • Diffusion maps has explicit distance metric in the embedding space, and can perform multiple scale analysis, as comparing to laplacian eigenmaps.
  • Diffusion distance is more robust than geodesic distance used in ISOMAP.

Experimental Results

  1. KTH Dataseta. To verify that our learnt semantic visual vocabulary (high level features) is more discriminative than the midlevel features.

    b. Comparison of the recognition rates using high-level(after embedding) and midlevel features (original).(k-means is used as a clustering for both)

    c. The influence of t and sigma on the action recognition rates. (sigma=3, t =5)

    d. A comparison of recognition rates using different manifold learning schemes.

    e. PMI captures the relationship between a particular midlevel feature and videos (images), as well as other midlevel features.

    f. Confusion table (KTH): running is easily misclassified as jogging.

  2. YouTube Datasetsa. The dataset has the following properties: a mix of still and moving cameras, cluttered background, variation in object scale, varied viewpoint, varied illumination, and low resolution.

    b. A comparison of recognition rates using different manifold learning schemes on You Tube dataset.

    c. The decay of the eigenvalues of P(t) on YouTube dataset when sigma is 14 and confusion table.
  3. Fifteen Scene Dataset

    a. Some examples of midlevel and high-level features with their corresponding real image patches.

    b. A Comparison of recognition rate between different manifold learning schemes.


  1. Power Point Presentation
  2. Poster

Related Publication

Jingen Liu, Yang Yang and Mubarak Shah, Learning Semantic Visual Vocabularies Using Diffusion Distance, IEEE International Conference on Computer Visiona and Pattern Recognition (CVPR), 2009.