SBAT: Video captioning with sparse boundary-aware transformer

34Citations
Citations of this article
41Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we focus on the problem of applying the transformer structure to video captioning effectively. The vanilla transformer is proposed for uni-modal language generation task such as machine translation. However, video captioning is a multimodal learning problem, and the video features have much redundancy between different time steps. Based on these concerns, we propose a novel method called sparse boundary-aware transformer (SBAT) to reduce the redundancy in video representation. SBAT employs boundary-aware pooling operation for scores from multihead attention and selects diverse features from different scenarios. Also, SBAT includes a local correlation scheme to compensate for the local information loss brought by sparse operation. Based on SBAT, we further propose an aligned cross-modal encoding scheme to boost the multimodal interaction. Experimental results on two benchmark datasets show that SBAT outperforms the state-of-the-art methods under most of the metrics.

Cite

CITATION STYLE

APA

Jin, T., Huang, S., Chen, M., Li, Y., & Zhang, Z. (2020). SBAT: Video captioning with sparse boundary-aware transformer. In IJCAI International Joint Conference on Artificial Intelligence (Vol. 2021-January, pp. 630–636). International Joint Conferences on Artificial Intelligence. https://doi.org/10.24963/ijcai.2020/88

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free