Multi-object tracking (MOT) requires detecting and associating objects through frames. Unlike tracking via detected bounding boxes or center points, we propose tracking objects as pixel-wise distributions. We instantiate this idea on a transformer-based architecture named P3AFormer, with pixel-wise propagation, prediction, and association. P3AFormer propagates pixel-wise features guided by flow information to pass messages between frames. Further, P3AFormer adopts a meta-architecture to produce multi-scale object feature maps. During inference, a pixel-wise association procedure is proposed to recover object connections through frames based on the pixel-wise prediction. P3AFormer yields 81.2% in terms of MOTA on the MOT17 benchmark – highest among all transformer networks to reach 80% MOTA in literature. P3AFormer also outperforms state-of-the-arts on the MOT20 and KITTI benchmarks. The code is at https://github.com/dvlab-research/ ECCV22-P3AFormer-Tracking-Objects-as-Pixel-wise-Distributions.
CITATION STYLE
Zhao, Z., Wu, Z., Zhuang, Y., Li, B., & Jia, J. (2022). Tracking Objects as Pixel-Wise Distributions. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 13682 LNCS, pp. 76–94). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-031-20047-2_5
Mendeley helps you to discover research relevant for your work.