Visual storytelling, which aims at automatically producing a narrative paragraph for photo album, remains quite challenging due to the complexity and diversity of photo album content. In addition, open-domain photo albums cover a broad range of topics and this results in highly variable vocabularies and expression styles to describe photo albums. In this work, a novel teacher-student visual storytelling framework with hierarchical BERT semantic guidance (HBSG) is proposed to address the above-mentioned challenges. The proposed teacher module consists of two joint tasks, namely, word-level latent topic generation and semantic-guided sentence generation. The first task aims to predict the latent topic of the story. As there is no ground-truth topic information, a pre-trained BERT model based on visual contents and annotated stories is utilized to mine topics. Then the topic vector is distilled to a designed image-topic prediction model. In the semantic-guided sentence generation task, HBSG is introduced for two purposes. The first is to narrow down the language complexity across topics, where the co-attention decoder with vision and semantic is designed to leverage the latent topics to induce topic-related language models. The second is to employ sentence semantic as an online external linguistic knowledge teacher module. Finally, an auxiliary loss is devised to transform linguistic knowledge into the language generation model. Extensive experiments are performed to demonstrate the effectiveness of HBSG framework, which surpasses the state-of-the-art approaches evaluated on the VIST test set.
CITATION STYLE
Fan, R., Wang, H., Gu, J., & Liu, X. (2021). Visual Storytelling with Hierarchical BERT Semantic Guidance. In ACM International Conference Proceeding Series. Association for Computing Machinery. https://doi.org/10.1145/3469877.3490604
Mendeley helps you to discover research relevant for your work.