Most existing text-based person search methods highly depend on exploring the corresponding relations between the regions of the image and the words in the sentence. However, these methods correlated image regions and words in the same semantic granularity. It 1) results in irrelevant corresponding relations between image and text, 2) causes an ambiguity embedding problem. In this study, we propose a novel multi-granularity embedding learning model for text-based person search. It generates multi-granularity embeddings of partial person bodies in a coarse-to-fine manner by revisiting the person image at different spatial scales. Specifically, we distill the partial knowledge from image scrips to guide the model to select the semantically relevant words from the text description. It can learn discriminative and modality-invariant visual-textual embeddings. In addition, we integrate the partial embeddings at each granularity and perform multi-granularity image-text matching. Extensive experiments validate the effectiveness of our method, which can achieve new state-of-the-art performance by the learned discriminative partial embeddings.
CITATION STYLE
Wang, C., Luo, Z., Lin, Y., & Li, S. (2021). Text-based Person Search via Multi-Granularity Embedding Learning. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1068–1074). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2021/148
Mendeley helps you to discover research relevant for your work.