Point break: Surfing heterogeneous data for subtitle segmentation

0Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.

Abstract

Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.

Cite

CITATION STYLE

APA

Karakanta, A., Negri, M., & Turchi, M. (2020). Point break: Surfing heterogeneous data for subtitle segmentation. In CEUR Workshop Proceedings (Vol. 2769). CEUR-WS. https://doi.org/10.4000/books.aaccademia.8620

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free