Even though Automatic Speech Recognition (ASR) systems significantly improved over the last decade, they still introduce a lot of errors when they transcribe voice to text. One of the most common reasons for these errors is phonetic confusion between similar-sounding expressions. As a result, ASR transcriptions often contain "quasi-oronyms", i.e., words or phrases that sound similar to the source ones, but that have completely different semantics (e.g., "win" instead of "when" or "accessible on defecting" instead of "accessible and affecting"). These errors significantly affect the performance of downstream Natural Language Understanding (NLU) models (e.g., intent classification, slot filling, etc.) and impair user experience. To make NLU models more robust to such errors, we propose novel phonetic-aware text representations. Specifically, we represent ASR transcriptions at the phoneme level, aiming to capture pronunciation similarities, which are typically neglected in word-level representations (e.g., word embeddings). To train and evaluate our phoneme representations, we generate noisy ASR transcriptions of four existing datasets-Stanford Sentiment Treebank, SQuAD, TREC Question Classification and Subjectivity Analysis-and show that common neural network architectures exploiting the proposed phoneme representations can effectively handle noisy transcriptions and significantly outperform state-of-the-art baselines. Finally, we confirm these results by testing our models on real utterances spoken to the Alexa virtual assistant.
CITATION STYLE
Fang, A., Filice, S., Limsopatham, N., & Rokhlenko, O. (2020). Using Phoneme Representations to Build Predictive Models Robust to ASR Errors. In SIGIR 2020 - Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 699–708). Association for Computing Machinery, Inc. https://doi.org/10.1145/3397271.3401050
Mendeley helps you to discover research relevant for your work.