Abstract
Inserting punctuation marks into the word chain hypothesis produced by automatic speech recognition (ASR) has long been a neglected task. In several application domains of ASR, real-time punctuation is, however, vital to improve human readability. The paper proposes and evaluates a prosody inspired approach and a phrase sequence model implemented as a recurrent neural network to predict the punctuation marks from the audio. In a very basic and lightweight modeling framework, we show that punctuation is possible by state-of-the-art performance, solely based on the audio signal for speech close to read quality. We test the approach on more spontaneous speaking styles and on ASR transcripts which may contain word errors. A subjective evaluation is also carried out to quantify the benefits of the punctuation on human readability, and we also show that when a critical punctuation accuracy is reached, humans are not able to distinguish automatic and human produced punctuation, even if the former may contain punctuation errors.
Author supplied keywords
Cite
CITATION STYLE
Szaszák, G. (2019). An audio-based sequential punctuation model for ASR and its effect on human readability. Acta Polytechnica Hungarica, 16(2), 93–108. https://doi.org/10.12700/APH.16.2.2019.2.6
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.