All data is only for research purposes, unless stated differently. Please make sure to reference the authors properly when using the data.

Video Anomaly Dection Dataset

  UCF-Crime dataset is a new large-scale first of its kind dataset of 128 hours of videos. It consists of 1900 long and untrimmed real-world surveillance videos, with 13 realistic anomalies including Abuse, Arrest, Arson, Assault, Road Accident, Burglary, Explosion, Fighting, Robbery, Shooting, Stealing, Shoplifting, and Vandalism. These anomalies are selected because they have a significant impact on public safety. This dataset can be used for two tasks. First, general anomaly detection considering all anomalies in one group and all normal activities in another group. Second, for recognizing each of 13 anomalous activities.

Real-world Anomaly Detection in Surveillance Videos
Waqas Sultani, Chen Chen, Mubarak Shah
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018
[Paper] [Video Presentation] [Project Website] [Note] [Code]
[Download the dataset: copy this URL:]
(Note: The "Anomaly_Train.txt" file in the zip file is corrupted, please down it here: Anomaly_Train.txt)
Option 2: Download the dataset from Dropbox (multiple files): Link

Satellite Smoke Scene Detection Dataset

  One important challenge for detecting fire smoke in satellite imagery is the similar disasters and multiple land covers. The commonly used smoke detection methods mainly focus on smoke discrimination from a few specific classes, which reduces their applicability in different regions of various classes. In addition, there is no satellite remote sensing smoke detection dataset so far. To this end, we construct the USTC_SmokeRS dataset and integrate more smoke-like aerosol classes and land covers in the dataset, for example, cloud, dust, haze, bright surfaces, lakes, seaside, vegetation, etc. The USTC_SmokeRS dataset contains a total of 6225 RGB images from six classes: cloud, dust, haze, land, seaside, and smoke. Each image was saved as the ".tif" format with the size of 256 × 256.

SmokeNet: Satellite Smoke Scene Detection Using Convolutional Neural Network with Spatial and Channel-Wise Attention
Rui Ba, Chen Chen, Jiang Yuan, Weiguo Song, Siuming Lo
Remote Sensing, 2019.
[Paper] [Project Website] [Download Dataset from Google Drive] [Download Dataset from OneDrive] [Download Dataset from Baidu Pan (download password: 5dlk)]

Malaria Life-Cycle Classification in Thin Blood Smear Images

  Malaria microscopy, microscopic examination of stained blood slides to detect parasite Plasmodium, is considered to be a gold-standard for detecting life-threatening disease malaria. Detecting the plasmodium parasite requires a skilled examiner and may take up to 10 to 15 minutes to completely go through the whole slide. Due to a lack of skilled medical professionals in the underdeveloped or resource deficient regions, many cases go misdiagnosed; resulting in unavoidable complications and/or undue medication. We propose to complement the medical professionals by creating a deep learning-based method to automatically detect (localize) the plasmodium parasites in the photograph of stained film. To handle the unbalanced nature of the dataset, we adopt a two-stage approach. Where the first stage is trained to detect blood cells and classify them into just healthy or infected. The second stage is trained to classify each detected cell further into the life-cycle stage. To facilitate the research in machine learning-based malaria microscopy, we introduce a new large scale microscopic image malaria dataset. Thirty-eight thousand cells are tagged from the 345 microscopic images of different Giemsa-stained slides of blood samples. Extensive experimentation is performed using different CNN backbones including VGG, DenseNet, and ResNet on this dataset. Our experiments and analysis reveal that the two-stage approach works better than the one-stage approach for malaria detection.

A Dataset and Benchmark for Malaria Life-Cycle Classification in Thin Blood Smear Images
Qazi Ammar Arshad, Mohsen Ali, Saeed-ul Hassan, Chen Chen, Ayisha Imran, Ghulam Rasul, Waqas Sultani
Neural Computing and Applications, 2021
[Paper] [Dataset]

VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval

  Cross-view image geo-localization aims to determine the locations of street-view query images by matching with GPS-tagged reference images from aerial view. Recent works have achieved surprisingly high retrieval accuracy on city-scale datasets. However, these results rely on the assumption that there exists a reference image exactly centered at the location of any query image, which is not applicable for practical scenarios. In this paper, we redefine this problem with a more realistic assumption that the query image can be arbitrary in the area of interest and the reference images are captured before the queries emerge. This assumption breaks the one-to-one retrieval setting of existing datasets as the queries and reference images are not perfectly aligned pairs, and there may be multiple reference images covering one query location. To bridge the gap between this realistic setting and existing datasets, we propose a new large-scale benchmark –VIGOR– for crossView Image Geo-localization beyond One-to-one Retrieval. We benchmark existing state-of-the-art methods and propose a novel end-to-end framework to localize the query in a coarse-to-fine manner. Apart from the image-level retrieval accuracy, we also evaluate the localization accuracy in terms of the actual distance (meters) using the raw GPS data.

VIGOR: Cross-View Image Geo-localization beyond One-to-one Retrieval
Sijie Zhu, Taojiannan Yang, Chen Chen
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021
[Paper] [Dataset and Code]

Cross-View Geolocalization Dataset

  UCF cross-view geolocalization dataset is created for the geo-localization task using cross-view image matching. The dataset has street view and bird's eye view image pairs around downtown Pittsburg, Orlando and part of Manhattan. There are 1,586, 1,324 and 5,941 GPS locations in Pittsburg, Orlando and Manhattan, respectively. We utilize DualMaps to generate side-by-side street view and bird's eye view images at each GPS location with the same heading direction. The street view images are from Google and the overhead 45 degree bird's eye view images are from Bing. For each GPS location, four image pairs are generated with camera heading directions of 0 degree, 90 degree, 180 degree and 270 degree. In order to learn the deep network for building matching, we annotate corresponding buildings in every street view and bird's eye view image pair.

Cross-View Image Matching for Geo-localization in Urban Environments
Yicong Tian, Chen Chen, Mubarak Shah
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[Paper] [Project (Download Cross-view dataset and code)]

UTD-MHAD Dataset

  UTD-MHAD dataset was collected as part of our research on human action recognition using fusion of depth and inertial sensor data. The objective of this research has been to develop algorithms for more robust human action recognition using fusion of data from differing modality sensors. The UTD-MHAD dataset consists of 27 different actions: (1) right arm swipe to the left, (2) right arm swipe to the right, (3) right hand wave, (4) two hand front clap, (5) right arm throw, (6) cross arms in the chest, (7) basketball shoot, (8) right hand draw x, (9) right hand draw circle (clockwise), (10) right hand draw circle (counter clockwise), (11) draw triangle, (12) bowling (right hand), (13) front boxing, (14) baseball swing from right, (15) tennis right hand forehand swing, (16) arm curl (two arms), (17) tennis serve, (18) two hand push, (19) right hand knock on door, (20) right hand catch an object, (21) right hand pick up and throw, (22) jogging in place, (23) walking in place, (24) sit to stand, (25) stand to sit, (26) forward lunge (left foot forward), (27) squat (two arms stretch out).

UTD-MHAD: A Multimodal Dataset for Human Action Recognition Utilizing a Depth Camera and a Wearable Inertial Sensor
Chen Chen, Roozbeh Jafari, Nasser Kehtarnavaz
IEEE International Conference on Image Processing (ICIP), 2015
[Paper] [UTD Multimodal Human Action Dataset Website]