Skip to main content

Visual Tracking: An Experimental Survey

Amsterdam Library of Ordinary Videos for tracking, ALOV++, aimes to cover as diverse circumstances as possible: illuminations, transparency, specularity, confusion with similar objects, clutter, occlusion, zoom, severe shape changes, different motion patterns,low contrast, and so on. In composing the ALOV++ dataset, preference was given to many assorted short videos over a few longer ones. In each of these aspects, we collect video sequences ranging from easy to difficult with the emphasis on difficult video. ALOV++ is also composed to be upward compatible with other benchmarks for tracking by including 11 standard tracking video sequences from existing datasets for the aspects which cover smoothness and occlusion. Additionally, we have selected 11 standard video sequences frequently used in recent tracking papers, on the aspects of light, albedo, transparency, motion smoothness, confusion, occlusion and shaking camera. 65 Sequences have been reported earlier in the PETS workshop, and 250 are new, for a total of 315 video sequences.

The main source of the data is real-life videos from YouTube with 64 different types of targets ranging from human face, a person, a ball, an octopus, microscopic cells, a plastic bag or a can. The collection is categorized for thirteen aspects of difficulty with many hard to very hard videos, like a dancer, a rock singer in a concert, complete transparent glass, octopus, flock of birds, soldier in camouflage, completely occluded object and videos with extreme zooming introducing abrupt motion of targets.

To maximize the diversity, most of the sequences are short. The average length of the short videos is 9.2 seconds with a maximum of 35 seconds. One additional category contains ten long videos with a duration between one and two minutes. The total number of frames in ALOV++ is 89364. The data in ALOV++ are annotated by a rectangular bounding box along the main axes of flexible size every fifth frame. In rare cases, when motion is rapid, the annotation is more frequent. The ground truth has been acquired for the intermediate frames by linear interpolation. The ground truth bounding box in the first frame is specified to the trackers.

Related Links

If you happen to use the data set, please refer to the following paper:

Arnold W. M. Smeulders, Dung M. Chu, Rita Cucchiara, Simone Calderara, Afshin Dehghan and Mubarak Shah, Visual Tracking: an Experimental Survey, In Proceeding of IEEE Transaction on Pattern Analysis and Machine Intelligence (PAMI), 2013.