Recovery of rare words in lecture speech

Stefan Kombrink; Mirko Hannemann; Lukáš Burget; Hynek Heřmanský

Conference Proceedings

Recovery of rare words in lecture speech

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2010) 6231 LNAI 330-337

DOI: 10.1007/978-3-642-15760-8_42

6Citations

13Readers

Get full text

Abstract

The vocabulary used in speech usually consists of two types of words: a limited set of common words, shared across multiple documents, and a virtually unlimited set of rare words, each of which might appear a few times only in particular documents. In most documents, however, these rare words are not seen at all. The first type of words is typically included in the language model of an automatic speech recognizer (ASR) and is thus widely referred to as in-vocabulary (IV). Words of the second type are missing in the language model and thus are called out-of-vocabulary (OOV). However, these words usually carry important information. We use a hybrid word/sub-word recognizer to detect OOV words occurring in English talks and describe them as sequences of sub-words. We detected about one third of all OOV words, and were able to recover the correct spelling for 26.2% of all detections by using a phoneme-to-grapheme (P2G) conversion trained on the recognition dictionary. By omitting detections corresponding to recovered IV words, we were able to increase the precision of the OOV detection substantially. © 2010 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Kombrink, S., Hannemann, M., Burget, L., & Heřmanský, H. (2010). Recovery of rare words in lecture speech. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6231 LNAI, pp. 330–337). https://doi.org/10.1007/978-3-642-15760-8_42

Recovery of rare words in lecture speech

Abstract

Cite

Register to see more suggestions