To exploit rich information from unlabeled data, in this work, we propose a novel self-supervised framework for visual tracking which can easily adapt the state-of-the-art supervised Siamese-based trackers into unsupervised ones by utilizing the fact that an image and any cropped region of it can form a natural pair for self-training. Besides common geometric transformation-based data augmentation and hard negative mining, we also propose adversarial masking which helps the tracker to learn other context information by adaptively blacking out salient regions of the target. The proposed approach can be trained offline using images only without any requirement of manual annotations and temporal information from multiple consecutive frames. Thus, it can be used with any kind of unlabeled data, including images and video frames. For evaluation, we take SiamFC as the base tracker and name the proposed self-supervised method as S2SiamFC. Extensive experiments and ablation studies on the challenging VOT2016 and VOT2018 datasets are provided to demonstrate the effectiveness of the proposed method which not only achieves comparable performance to its supervised counterpart and other unsupervised methods requiring multiple frames.
CITATION STYLE
Sio, C. H., Ma, Y. J., Shuai, H. H., Chen, J. C., & Cheng, W. H. (2020). S2SiamFC: Self-supervised Fully Convolutional Siamese Network for Visual Tracking. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 1948–1957). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413611
Mendeley helps you to discover research relevant for your work.