Hierarchical3D Adapters for Long Video-to-text Summarization

3Citations
Citations of this article
28Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we focus on video-to-text summarization and investigate how to best utilize multimodal information for summarizing long inputs (e.g., an hour-long TV show) into long outputs (e.g., a multi-sentence summary). We extend SummScreen (Chen et al., 2022), a dialogue summarization dataset consisting of transcripts of TV episodes with reference summaries, and create a multimodal variant by collecting corresponding full-length videos. We incorporate multimodal information into a pretrained textual summarizer efficiently using adapter modules augmented with a hierarchical structure while tuning only 3.8% of model parameters. Our experiments demonstrate that multimodal adapters outperform more memory-heavy and fully fine-tuned textual summarization methods.

Cite

CITATION STYLE

APA

Papalampidi, P., & Lapata, M. (2023). Hierarchical3D Adapters for Long Video-to-text Summarization. In EACL 2023 - 17th Conference of the European Chapter of the Association for Computational Linguistics, Findings of EACL 2023 (pp. 1267–1290). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.findings-eacl.96

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free