Class-based N-gram language model for new words using out-of-vocabulary to in-vocabulary similarity

Welly Naptali; Masatoshi Tsuchiya; Seiichi Nakagawa

Journal ArticleOPEN ACCESS

Class-based N-gram language model for new words using out-of-vocabulary to in-vocabulary similarity

IEICE Transactions on Information and Systems (2012) E95-D(9) 2308-2317

DOI: 10.1587/transinf.E95.D.2308

14Citations

8Readers

Abstract

Out-of-vocabulary (OOV) words create serious problems for automatic speech recognition (ASR) systems. Not only are they missrecognized as in-vocabulary (IV) words with similar phonetics, but the error also causes further errors in nearby words. Language models (LMs) for most open vocabulary ASR systems treat OOV words as a single entity, ignoring the linguistic information. In this paper we present a class-based n-gram LM that is able to deal with OOV words by treating each of them individually without retraining all the LM parameters. OOV words are assigned to IV classes consisting of similar semantic meanings for IV words. The World Wide Web is used to acquire additional data for finding the relation between the OOV and IV words. An evaluation based on adjusted perplexity and word-error-rate was carried out on the Wall Street Journal corpus. The result suggests the preference of the use of multiple classes for OOV words, instead of one unknown class. Copyright © 2012 The Institute of Electronics, Information and Communication Engineers.

Author supplied keywords

Cite

CITATION STYLE

APA

Naptali, W., Tsuchiya, M., & Nakagawa, S. (2012). Class-based N-gram language model for new words using out-of-vocabulary to in-vocabulary similarity. IEICE Transactions on Information and Systems, E95-D(9), 2308–2317. https://doi.org/10.1587/transinf.E95.D.2308

Class-based N-gram language model for new words using out-of-vocabulary to in-vocabulary similarity

Abstract

Author supplied keywords

Cite

Register to see more suggestions