Improvement of language models using dual-source backoff

Sehyeong Cho

Conference Proceedings

Improvement of language models using dual-source backoff

Cho S

Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (2004) 3157 892-900

DOI: 10.1007/978-3-540-28633-2_94

0Citations

1Readers

Get full text

Abstract

Language models are essential in predicting the next word in a spoken sentence, thereby enhancing the speech recognition accuracy, among other things. However, spoken language domains are too numerous, and therefore developers suffer from the lack of corpora with sufficient sizes. This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but incorrectly biased. With our approach, two n-gram statistics are combined by extending the idea of Katz's backoff and therefore is called a dual-source backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus. © Springer-Verlag Berlin Heidelberg 2004.

Cite

CITATION STYLE

APA

Cho, S. (2004). Improvement of language models using dual-source backoff. In Lecture Notes in Artificial Intelligence (Subseries of Lecture Notes in Computer Science) (Vol. 3157, pp. 892–900). Springer Verlag. https://doi.org/10.1007/978-3-540-28633-2_94

Improvement of language models using dual-source backoff

Abstract

Cite

Register to see more suggestions