Existing multilingual video corpus moment retrieval (mVCMR) methods are mainly based on a two-stream structure. The visual stream utilizes the visual content in the video to estimate the query-visual similarity, and the subtitle stream exploits the query-subtitle similarity. The final query-video similarity ensembles similarities from two streams. In our work, we propose a simple and effective strategy termed as Cross-lingual Cross-modal Consolidation (C3) to improve mVCMR accuracy. We adopt the ensemble similarity as the teacher to guide the training of each stream, leading to a more powerful ensemble similarity. Meanwhile, we use the teacher for a specific language to guide the student for another language to exploit the complementary knowledge across languages. Extensive experiments on mTVR dataset demonstrate the effectiveness of our C3 method.
CITATION STYLE
Liu, J., Yu, T., Peng, H., Sun, M., & Li, P. (2022). Cross-Lingual Cross-Modal Consolidation for Effective Multilingual Video Corpus Moment Retrieval. In Findings of the Association for Computational Linguistics: NAACL 2022 - Findings (pp. 1854–1862). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2022.findings-naacl.142
Mendeley helps you to discover research relevant for your work.