Abstract
We propose to generalize language models for conversational speech recognition to allow them to operate across utterance boundaries and speaker changes, thereby capturing conversation-level phenomena such as adjacency pairs, lexical entrainment, and topical coherence. The model consists of a long-short-term memory (LSTM) recurrent network that reads the entire word-level history of a conversation, as well as information about turn taking and speaker overlap, in order to predict each next word. The model is applied in a rescoring framework, where the word history prior to the current utterance is approximated with preliminary recognition results. In experiments in the conversational telephone speech domain (Switchboard) we find that such a model gives substantial perplexity reductions over a standard LSTM-LM with utterance scope, as well as improvements in word error rate.
Cite
CITATION STYLE
Xiong, W., Wu, L., Zhang, J., & Stolcke, A. (2018). Session-level language modeling for conversational speech. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 2764–2768). Association for Computational Linguistics. https://doi.org/10.18653/v1/d18-1296
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.