Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Wenzhe Wang; Mengdan Zhang; Runnan Chen; Guanyu Cai; Penghao Zhou; Pai Peng; Xiaowei Guo; Jian Wu; Xing Sun

Conference Proceedings

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

IJCAI International Joint Conference on Artificial Intelligence (2021) 1113-1121

DOI: 10.24963/ijcai.2021/154

17Citations

11Readers

Get full text

Abstract

Multi-modal cues presented in videos are usually beneficial for the challenging video-text retrieval task on internet-scale datasets. Recent video retrieval methods take advantage of multi-modal cues by aggregating them to holistic high-level semantics for matching with text representations in a global view. In contrast to this global alignment, the local alignment of detailed semantics encoded within both multi-modal cues and distinct phrases is still not well conducted. Thus, in this paper, we leverage the hierarchical video-text alignment to fully explore the detailed diverse characteristics in multi-modal cues for fine-grained alignment with local semantics from phrases, as well as to capture a high-level semantic correspondence. Specifically, multi-step attention is learned for progressively comprehensive local alignment and a holistic transformer is utilized to summarize multi-modal cues for global alignment. With hierarchical alignment, our model outperforms state-of-the-art methods on three public video retrieval datasets.

Cite

CITATION STYLE

APA

Wang, W., Zhang, M., Chen, R., Cai, G., Zhou, P., Peng, P., … Sun, X. (2021). Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment. In IJCAI International Joint Conference on Artificial Intelligence (pp. 1113–1121). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2021/154

Dig into Multi-modal Cues for Video Retrieval with Hierarchical Alignment

Abstract

Cite

Register to see more suggestions