Video based person re-identification aims to associate video clips with the same identity by designing discriminative and representative features. Existing approaches simply compute representations for video clips via frame-level or region-level feature aggregation, where fine-grained local information is inaccessible. To address this issue, we propose a novel module called fine-grained fusion with distractor suppression (short as FFDS) to fully exploit the local features towards better representation of a specific video clip. Concretely, in the proposed FFDS module, the importance of each local feature of an anchor image is calculated by pixel-wise correlation mining with other intra-sequence frames. In this way, 'good' local features co-exist across the video frames are enhanced in the attention map, while sparse 'distractors' can be suppressed. Moreover, to maintain the high-level semantic information of deep CNN features as well as enjoying the fine-grained local information, we adopt the feature mimicking scheme during the training process. Extensive experiments on two challenging large-scale datasets demonstrate effectiveness of the proposed method.
CITATION STYLE
Xi, J., Zhou, Q., Zhao, Y., & Zheng, S. (2019). Fine-Grained Fusion with Distractor Suppression for Video-Based Person Re-Identification. IEEE Access, 7, 114310–114319. https://doi.org/10.1109/ACCESS.2019.2932102
Mendeley helps you to discover research relevant for your work.