Mainstream multi-object tracking methods exploit appearance information and/or motion information to achieve interframe association. However, dealing with similar appearance and occlusion is a challenge for appearance information, while motion information is limited by linear assumptions and is prone to failure in nonlinear motion patterns. In this work, we disregard appearance clues and propose a pure motion tracker to address the above issues. It dexterously utilizes Transformer to estimate complex motion and achieves high-performance tracking with low computing resources. Furthermore, contrastive learning is introduced to optimize feature representation for robust association. Specifically, we first exploit the long-range modeling capability of Transformer to mine intention information in temporal motion and decision information in spatial interaction and introduce prior detection to constrain the range of motion estimation. Then, we introduce contrastive learning as an auxiliary task to extract reliable motion features to compute affinity and introduce bidirectional matching to improve the affinity computation distribution. In addition, given that both tasks are dedicated to narrowing the embedding distance between the motion features of the tracked object and the detection features, we design a joint-motion-and-association framework to unify the above two tasks in one framework for optimization. The experimental results achieved with three benchmark datasets, MOT17, MOT20 and DanceTrack, verify the effectiveness of our proposed method. Compared with state-of-the-art methods, the proposed STDFormer sets a new state-of-the-art on DanceTrack and achieves competitive performance on MOT17 and MOT20. This demonstrates the advantage of our method in handling associations under similar appearance, occlusion or nonlinear motion. At the same time, the significant advantages of the proposed method over Transformer-based and contrastive learning-based methods suggest a new direction for the application of Transformer and contrastive learning in MOT. In addition, to verify the generalization of STDFormer in unmanned aerial vehicle (UAV) videos, we also evaluate STDFormer on VisDrone2019. The results show that STDFormer achieves state-of-the-art performance on VisDrone2019, which proves that it can handle small-scale object associations in UAV videos well. The code is available at https://github.com/Xiaotong-Zhu/STDFormer.
CITATION STYLE
Hu, M., Zhu, X., Wang, H., Cao, S., Liu, C., & Song, Q. (2023). STDFormer: Spatial-Temporal Motion Transformer for Multiple Object Tracking. IEEE Transactions on Circuits and Systems for Video Technology, 33(11), 6571–6594. https://doi.org/10.1109/TCSVT.2023.3263884
Mendeley helps you to discover research relevant for your work.