Dual-source backoff for enhancing language models

Sehyeong Cho

Journal Article

Dual-source backoff for enhancing language models

Cho S

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2004) 3314 407-412

DOI: 10.1007/978-3-540-30497-5_63

0Citations

3Readers

Get full text

Abstract

This paper proposes a method of combining two n-gram language models to construct a single language model. One of the corpora is constructed from a very small corpus of the right domain of interest, and the other is constructed from a large but less adequate corpus. This method is based on the observation that a small corpus from the right domain has high quality n-grams but suffers from sparseness problem, while a large corpus from another domain is inadequately biased, but easy to obtain bigger size. The basic idea behind dual-source backoff is basically the same with Katz's backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several millions to tens of millions words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement by incorporating a small corpus around one thirtieth size of the newspaper corpus. © Springer-Verlag 2004.

Cite

CITATION STYLE

APA

Cho, S. (2004). Dual-source backoff for enhancing language models. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3314, 407–412. https://doi.org/10.1007/978-3-540-30497-5_63

Dual-source backoff for enhancing language models

Abstract

Cite

Register to see more suggestions