ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language

79Citations
Citations of this article
85Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Person search by natural language aims at retrieving a specific person in a large-scale image pool that matches given textual descriptions. While most of the current methods treat the task as a holistic visual and textual feature matching one, we approach it from an attribute-aligning perspective that allows grounding specific attribute phrases to the corresponding visual regions. We achieve success as well as a performance boost by a robust feature learning that the referred identity can be accurately bundled by multiple attribute cues. To be concrete, our Visual-Textual Attribute Alignment model (dubbed as ViTAA) learns to disentangle the feature space of a person into sub-spaces corresponding to attributes using a light auxiliary attribute segmentation layer. It then aligns these visual features with the textual attributes parsed from the sentences via a novel contrastive learning loss. We validate our ViTAA framework through extensive experiments on tasks of person search by natural language and by attribute-phrase queries, on which our system achieves state-of-the-art performances. Codes and models are available at https://github.com/Jarr0d/ViTAA.

Cite

CITATION STYLE

APA

Wang, Z., Fang, Z., Wang, J., & Yang, Y. (2020). ViTAA: Visual-Textual Attributes Alignment in Person Search by Natural Language. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 12357 LNCS, pp. 402–420). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-3-030-58610-2_24

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free