Skip to main content

Holistic Object Detection and Image Understanding



Gonzalo Vaca-Castano, Niels DaVitoria Lobo, Mubarak Shah, Holistic Object Detection and Image Understanding, Computer Vision and Image Understanding, vol. 181 , pp. 1-13, 2019. [BibTex]


This paper proposes a new representation of the visual content of an image that allows learning about what elements are part of an image and the hierarchical structure that they form. Our representation is a Top-Down Visual-Tree, where every node represents a bounding box, label, and visual feature of an object existing in the image. Each image and its object annotations from a training dataset are parsed to obtain the proposed visual representation.

a) Image

b) Most confident detector outputs

c) Top-Down Visual-Tree (TDVT) image representation. Part dependencies are depicted in green color, and Weak dependencies are depicted in red color. Subsection 3.1 gives details about the representation and explains the type of dependencies.

These images and their parsed tree representations are trained using a Top-Down Tree LSTM (Long Short Term Memory) network.

Proposed training model utilized for the prediction of the associated object according to a hierarchical structure. The input image passes once through a Convolutional Neural Network, and visual features are extracted for each node of the tree structure representation. The network is trained with the goal of predicting the label of the node that the examined edge is connecting to. In the figure, the edge that contains as a starting node the man in red shirt is being examined, and the edge must predict the correct label ‘sneakers’. The predicted label, 𝑊7 , is encoded as a one-hot vector. Every new processed edge updates the value of the hidden state ℎ7 (the ‘7’ refers to the sequence ordering). The hidden state keeps tracks of the tree structure previously processed.

The encoded information, allows integrate object detection and image understanding in a single process. The presented holistic object detection is not agnostic to the overall content of the image, and it is influenced by the image composition and the parts discovered. During testing time, from an image, we can infer the most prominent type of objects and their locations, the parts of these objects, and having a proper understanding of the image content through the obtained Top-Down Visual-Tree representation output. The accuracy of our object detection process increases notably respect to the baseline Fast R-CNN method in the visual genome test dataset.

Experimental setup

We perform experiments on the visual genome dataset (Krishna et al., 2016). The visual genome dataset has 108,077 images from the intersection of the YFCC100M and MS-COCO datasets. The annotation includes 5.4 million region descriptions and 2.3 million pair-wise relationships, which are used as possible nodes of the TDVT image representation.

Using the training data, the number of instances of each object category after obtaining TDVT representations are sorted in descending order to establish the most common object categories. We selected the first two thousand popular labels as the classes to be used in the experiments. The most common class of the training dataset is ‘man’ with 37,292 training samples, while the class ranked 2000 is ‘desert’ and only has 52 samples. Fine-tuning of a Fast R-CNN object detection model is performed using the full annotations of the selected two thousand classes. The network used to train the object is based on the VGG 16 network and EdgeBox is used as the object proposal method.

Figure shows the results obtained for two randomly selected test images. While traditional object detection produces a set of bounding boxes and scores with the most likely objects, our method produces a reduced set of objects that are essential to describe the visual content, and the possible relations between these objects.

Results and evaluation

We present quantitative results performed over the 5000 images from the original test split of the visual genome dataset. We compare our method against the object detection results of the Fast-RCNN object detector.

Examples of images showing the most confident detections using Fast R-CNN object detection (left side) and detections obtained with the proposed method (right side)

The Table 1 shows the Average Precision–Recall (APR) for the forty-four most common categories in the training dataset. Our method improves the detection in 38 out of these 44 classes. The average improvement of this large set of categories is +4.89 APR per category, while the loss in performance of the remaining six categories is considerably lower (−1.75 APR per category). The loss in performance in the six classes was small compared to the gain on most of them. The last column of Table 1 is the mean Average Precision (mAP) computed considering all the objects in two thousand object categories. The mAP for the Fast-RCNN detector with two thousand categories was 11.12, and it increased to 22.53 using the proposed approach.

Related Publications

[1] Gonzalo Vaca-Castano, Samarjit Das, Joao P Sousa, Niels D Lobo, Mubarak Shah, Improved scene identification and object detection on egocentric vision of daily activities, Computer Vision and Image Understanding CVIU (2017). Volume 156. Pages 92-103. [PDF] [BibTeX]

[2] Gonzalo Vaca-Castano, Samarjit Das, and Joao P Sousa, Improving Egocentric Vision Of Daily Activities, IEEE International Conference on Image Processing (ICIP), 2015. [PDF] [BibTeX]