Abstract
Text-based person retrieval aims at searching for a pedestrian image from multiple candidates with textual descriptions. It is challenging due to uncertain cross-modal alignments caused by the large intra-class variations. To address the challenge, most existing approaches rely on various attention mechanisms and auxiliary information, yet still struggle with the uncertain cross-modal alignments arising from significant intra-class variation, leading to coarse retrieval results. To this end, we propose a novel framework termed Deep Cross-modal Evidential Learning (DCEL), which deploys evidential deep learning to consider the cross-modal alignment uncertainty. Our DCEL model comprises three components: (1) Bidirectional Evidential Learning, which models alignment uncertainty to measure and mitigate the influence of large intra-class variation; (2) Multi-level Semantic Alignment, which leverages a proposed Semantic Filtration module and image-text similarity distribution to facilitate cross-modal alignments; (3) Cross-modal Relation Learning, which reasons about latent correspondences between multi-level tokens of image and text. Finally, we integrate the advantages of the three proposed components to enhance the model to achieve reliable cross-modal alignments. Our DCEL method consistently outperforms more than ten state-of-the-art methods in supervised, weakly supervised, and domain generalization settings on three benchmarks: CUHK-PEDES, ICFG-PEDES, and RSTPReid.
Author supplied keywords
Cite
CITATION STYLE
Li, S., Xu, X., Yang, Y., Shen, F., Mo, Y., Li, Y., & Shen, H. T. (2023). DCEL: Deep Cross-modal Evidential Learning for Text-Based Person Retrieval. In MM 2023 - Proceedings of the 31st ACM International Conference on Multimedia (pp. 6292–6300). Association for Computing Machinery, Inc. https://doi.org/10.1145/3581783.3612244
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.