The aim of video summarization is to refine a video into concise form without losing its gist. In general, a summary with similar semantics to the original video can represent the original well. Unfortunately, most existing methods focus more on the diversity and representation content of the video, and few of them take video’s semantic into consideration. In addition, most methods related to semantic pursue their own description of the video and this way will learn a biased mode. In order to solve these issues, we propose a novel semantic-consistent unsupervised framework termed ScSUM which is able to extract the essence of the video via obtaining the greatest semantic similarity, and requires no manual description. In particular, ScSUM consists of a frame selector, and a video descriptor which not only predict the description of summary, but also produce the description of original video used as targets. The main goal of our propose is to minimize the distance between the summary and original video in the semantic space. Finally, experiments on two benchmark datasets validate the effectiveness of the proposed methods and demonstrate that our method achieves competitive performance.
CITATION STYLE
Zhao, Y., Hu, X., Liu, X., & Fan, C. (2020). Learning unsupervised video summarization with semantic-consistent network. In Communications in Computer and Information Science (Vol. 1265 CCIS, pp. 207–219). Springer. https://doi.org/10.1007/978-981-15-7670-6_18
Mendeley helps you to discover research relevant for your work.