Cross-modality video segment retrieval with ensemble learning

0Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Jointly modeling vision and language is a new research area which has many applications, such as video segment retrieval and video dense caption. Compared with video language retrieval, video segment retrieval is a novel task that uses natural language to retrieve a specific video segment from the whole video. One common method is to learn a similarity metric between video and language features. In this chapter, we utilize ensemble learning method to learn a video segment retrieval model. Our ensemble model aims to combine each single-stream model to learn a better similarity metric. We evaluate our method on the task of the video clip retrieval with the new proposed Distinct Describable Moments dataset. Extensive experiments have shown that our approach achieves improvement compared with the result of the state-of-art.

Cite

CITATION STYLE

APA

Yu, X., Zhang, Y., & Zhang, R. (2020). Cross-modality video segment retrieval with ensemble learning. In Domain Adaptation for Visual Understanding (pp. 65–79). Springer International Publishing. https://doi.org/10.1007/978-3-030-30671-7_5

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free