Improving an Object Detector and Extracting Regions using Superpixel
We propose an approach to improve the detection performance of a generic detector when it is applied to a particular video. The performance of offline-trained objects detectors are usually degraded in unconstrained video environments due to variant illuminations, backgrounds and camera viewpoints. Moreover, most object detectors are trained using Haar-like features or gradient features but ignore video specific features like consistent color patterns. In our approach, we apply a Superpixel-based Bag-of-Words (BoW) model to iteratively refine the output of a generic detector. Compared to other related work, our method builds a video-specific detector using superpixels, hence it can handle the problem of appearance variation. Most importantly, using Conditional Random Field (CRF) along with our super pixel-based BoW model, we develop an algorithm to segment the object from the background . Therefore our method generates an output of the exact object regions instead of the bounding boxes generated by most detectors. In general, our method takes detection bounding boxes of a generic detector as input and generates the detection output with higher average precision and precise object regions. The experiments on four recent datasets demonstrate the effectiveness of our approach and significantly improves the state-of-art detector by 5-16% in average precision.
The overview of our approach is illustrated in figure below. First we apply the original detector with a low detection threshold on every frame of a video and obtain a substantial amount of detection examples. Those examples are initially labeled as positive or hard by their confidences. Negative examples are collected automatically from background. Second, we extract superpixel features from all examples and make a Bag-of-Word representation for each example. In the last step, we train a SVM model with positive and negative examples and label the hard examples iteratively. Each time a small number of hard examples are conservatively added into the training set until the iterations converge.
We extensively experimented on the proposed method using four dataset: Pets2009, Oxford Town Center , PNNL-Parking Lot  and our own Skateborading sequences. The experimental datasets provide a wide range of significant challenges including occlusion, camera motion, crowded scenes and cluttered background. In all the sequences, we only use the visual information and do not use any scene knowledge such as the camera calibration or the static obstacles. We compare our method with the original DPM detector. We also compare the superpixel-based appearance model (SP) with HOG within our online-learning framework. The precision-recall curves for all the six sequences as well as average precisions are shown below.
PNNL_Parking lot 2 sequence is a modestly crowded scene including groups of pedestrians walking in queues. The challenges in this data set include long-term inter-objects occlusions, camera jittering, and similarity of appearance among the humans in the scene. This sequence consists of 1,000 frames of a relatively crowded scene with up to 13 pedestrians. The frame resolution in this data set is 1920 X 1080, and the frame rate of 29 fps.
To download the data set click here.
To download the code and detection outputs click here.
Guang Shu, Afshin Dehghan and Mubarak Shah, Improving an Object Detector and Extracting Regions using Superpixel, Computer Visiona and Pattern Recognition 2013, Portland, Or, July 16-21, 2013.