Skip to main content

Self Supervised Learning for Multiple Object Tracking in 3D Point Clouds



Aakash Kumar, Jyoti Kini, Ajmal Mian, Mubarak Shah, Self Supervised Learning for Multiple Object Tracking in 3D Point Clouds, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022.


Multiple object tracking in 3D point clouds has applications in mobile robots and autonomous driving. This is a challenging problem due to the sparse nature of the point clouds and the added difficulty of annotation in 3D for supervised learning. To overcome these challenges, we propose a neural network architecture that learns effective object features and their affinities in a self supervised fashion for multiple object tracking in 3D point clouds captured with LiDAR sensors. For self supervision, we use two approaches. First, we generate two augmented LiDAR frames from a single real frame by applying translation, rotation and cutout to the objects. Second, we synthesize a LiDAR frame using CAD models or primitive geometric shapes and then apply the above three augmentations to them. Hence, the ground truth object locations and associations are known in both frames for self supervision. This removes the need to annotate object associations in real data, and additionally the need for training data collection and annotation for object detection in synthetic data. To the best of our knowledge, this is the first self supervised multiple object tracking method for 3D data. Our model achieves state of the art results.


The proposed self supervised 3D multiple object tracking (SS3D-MOT) model (see Fig. 1) adopts the tracking-by detection paradigm and learns to predict the affinity matrix between two frames. Ground-truth detections are used to crop the objects (to be tracked) from a single LiDAR frame. The cropped objects are then augmented to generate a pair of augmented frames. Since the objects come from a single frame, the ground truth correspondences (affinity matrix) are known. SS3D-MOT uses object and box features together to generate the affinity matrix. We use real LiDAR frames, one at a time, for self supervision. Hence, we do not require annotations for object associations between frames. However, the annotation of object detections with bounding boxes in individual frames is still required. Although, this can be done with an external off the shelf object detector, it makes the network learning sensitive to the errors of the used detector. Therefore, we also create synthetic LiDAR frames from CAD models or primitive geometric shapes, where the ground-truth detections as well as object associations are both known a priori. This not only eliminates the need for ground-truth associations, but also saves the cost of collecting real-world LiDAR data of moving objects and the need for an external detector for training. Interestingly, our model learns generic 3D object features, and when trained on one object type, it still performs quite well on predicting the tracks of another object type. We show that SS3D-MOT trained on synthetic LiDAR frames generated from primitive geometric shapes performs well on the validation set of KITTI and JRDB real datasets to track cars and pedestrians.


Results of our model on the validation sets of KITTI and JRDB datasets, when trained in supervised v/s self supervised setting.

Ablation study for self supervision: The effect of different augmentations when our model is trained on the real JRDB dataset or synthetic LiDAR frames.

Our model achieves the best MOTA, IDF1 and HOTA scores on the leaderboard using self supervision with synthetic frames generated from CAD objects. Leader-board can be accessed using the following link.