Jointly modeling vision and language is a new research area which has many applications, such as video segment retrieval and video dense caption. Compared with video language retrieval, video segment retrieval is a novel task that uses natural language to retrieve a specific video segment from the whole video. One common method is to learn a similarity metric between video and language features. In this chapter, we utilize ensemble learning method to learn a video segment retrieval model. Our ensemble model aims to combine each single-stream model to learn a better similarity metric. We evaluate our method on the task of the video clip retrieval with the new proposed Distinct Describable Moments dataset. Extensive experiments have shown that our approach achieves improvement compared with the result of the state-of-art.
CITATION STYLE
Yu, X., Zhang, Y., & Zhang, R. (2020). Cross-modality video segment retrieval with ensemble learning. In Domain Adaptation for Visual Understanding (pp. 65–79). Springer International Publishing. https://doi.org/10.1007/978-3-030-30671-7_5
Mendeley helps you to discover research relevant for your work.