Point break: Surfing heterogeneous data for subtitle segmentation

Alina Karakanta; Matteo Negri; Marco Turchi

Conference ProceedingsOPEN ACCESS

Point break: Surfing heterogeneous data for subtitle segmentation

CEUR Workshop Proceedings (2020) 2769

DOI: 10.4000/books.aaccademia.8620

0Citations

5Readers

Abstract

Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.

Cite

CITATION STYLE

APA

Karakanta, A., Negri, M., & Turchi, M. (2020). Point break: Surfing heterogeneous data for subtitle segmentation. In CEUR Workshop Proceedings (Vol. 2769). CEUR-WS. https://doi.org/10.4000/books.aaccademia.8620

Point break: Surfing heterogeneous data for subtitle segmentation

Abstract

Cite

Register to see more suggestions