Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval

19Citations
Citations of this article
31Readers
Mendeley users who have this article in their library.

Abstract

Existing multilingual video corpus moment retrieval (mVCMR) methods are mainly based on a two-stream structure. The visual stream utilizes the visual content in the video to estimate the query-visual similarity, and the subtitle stream exploits the query-subtitle similarity. The final query-video similarity ensembles similarities from two streams. In our work, we propose a simple and effective strategy termed as Cross-lingual Cross-modal Consolidation (C3) to improve mVCMR accuracy. We adopt the ensemble similarity as the teacher to guide the training of each stream, leading to a more powerful ensemble similarity. Meanwhile, we use the teacher for a specific language to guide the student for another language to exploit the complementary knowledge across languages. Extensive experiments on mTVR dataset demonstrate the effectiveness of our C3 method.

Cite

CITATION STYLE

APA

Liu, J., Yu, T., Peng, H., Sun, M., & Li, P. (2022). Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022 - Findings (pp. 1854–1862). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-naacl.142

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free