Skip to main content

3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds



Jyoti Kini, Ajmal Mian, Mubarak Shah, 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D Point Clouds, IEEE International Conference on Robotics and Automation (ICRA) 2023.


We propose a method for joint detection and tracking of multiple objects in 3D point clouds, a task conventionally treated as a two-step process comprising object detection followed by data association. Our method embeds both steps into a single end-to-end trainable network eliminating the dependency on external object detectors. Our model exploits temporal information employing multiple frames to detect objects and track them in a single network, thereby making it a utilitarian formulation for real-world scenarios. Computing affinity matrix by employing features similarity across consecutive point cloud scans forms an integral part of visual tracking. We propose an attention-based refinement module to refine the affinity matrix by suppressing erroneous correspondences. The module is designed to capture the global context in affinity matrix by employing self-attention within each affinity matrix and cross-attention across a pair of affinity matrices. Unlike competing approaches, our network does not require complex post-processing algorithms, and processes raw LiDAR frames to directly output tracking results. We demonstrate the effectiveness of our method on the three tracking benchmarks: JRDB, Waymo, and KITTI. Experimental evaluations indicate the ability of our model to generalize well across datasets.


Our proposed 3DMODT has four main building blocks: (1) transformer encoder for feature extraction, (2) affinity computation and attention-based refinement, (3) tracking offset and 3D detection prediction, and (4) data association and tracklet-generation. We extract feature tokens ft, ft-Τ from consecutive point cloud scans Pt, Pt-Τ. Thereafter, we use these feature tokens to construct affinity matrix At that stores dense similarity matches of tokens associated with t and t-Τ. The affinity matrix is then passed through attention mechanism leveraging global receptive fields to get refined affinity matrix Ât. Network heads are employed to compute: (a) tracking offsets Oti, j, k that store spatio-temporal displacements for all points, denoted by 3D locations (i, j, k), from time t to the corresponding points at t-Τ, and (b) regress 3D object information (center, bounding box, rotation). 3D bounding box information combined with tracking offset is used to generate tracklets. Our model processes three point cloud scans simultaneously in a single pass to leverage self-attention and cross-attention across affinity matrices, thereby improving the temporal context.


  • Quantitative Results
  • Our method exhibits competitive results on the JRDB test set without using additional input modality, any external detector, or a complex post-processing algorithm.

    We report the best tracking results on the Waymo vehicle validation set and the values are in the format LEVEL_1 / LEVEL_2. Here, L implies LiDAR data, and I refers to camera feed.

    Our tracker demonstrates the best 3D tracking performance on the KITTI cars validation set. Here, L implies LiDAR data, and I refers to camera feed.

  • Qualitative Results