Subtitles, in order to achieve their purpose of transmitting information, need to be easily readable. The segmentation of subtitles into phrases or linguistic units is key to their readability and comprehension. However, automatically segmenting a sentence into subtitles is a challenging task and data containing reliable human segmentation decisions are often scarce. In this paper, we leverage data with noisy segmentation from large subtitle corpora and combine them with smaller amounts of high-quality data in order to train models which perform automatic segmentation of a sentence into subtitles. We show that even a minimum amount of reliable data can lead to readable subtitles and that quality is more important than quantity for the task of subtitle segmentation.
CITATION STYLE
Karakanta, A., Negri, M., & Turchi, M. (2020). Point break: Surfing heterogeneous data for subtitle segmentation. In CEUR Workshop Proceedings (Vol. 2769). CEUR-WS. https://doi.org/10.4000/books.aaccademia.8620
Mendeley helps you to discover research relevant for your work.