We develop and probe a model for detecting the boundaries of prosodic chunks in untranscribed conversational English speech. The model is obtained by fine-tuning a Transformer-based speech-to-text (STT) model to integrate the identification of Intonation Unit (IU) boundaries with the STT task. The model shows robust performance, both on held-out data and on out-of-distribution data representing different dialects and transcription protocols. By evaluating the model on degraded speech data, and comparing it with alternatives, we establish that it relies heavily on lexico-syntactic information inferred from audio, and not solely on acoustic information typically understood to cue prosodic structure. We release our model1 as both a transcription tool and a baseline for further improvements in prosodic segmentation.
CITATION STYLE
Roll, N., Graham, C., & Todd, S. (2023). PSST! Prosodic Speech Segmentation with Transformers. In CoNLL 2023 - 27th Conference on Computational Natural Language Learning, Proceedings (pp. 476–487). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2023.conll-1.31
Mendeley helps you to discover research relevant for your work.