Overcoming the sparseness problem of spoken language corpora using other large corpora of distinct characteristics

Sehyeong Cho; Sang Hun Kim; Jun Park; Young Jik Lee

Journal Article

Overcoming the sparseness problem of spoken language corpora using other large corpora of distinct characteristics

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2004) 2945 407-411

DOI: 10.1007/978-3-540-24630-5_48

0Citations

1Readers

Get full text

Abstract

This paper proposes a method of combining two n-gram language models, one constructed from a very small corpus of the right domain of interest, the other constructed from a large but less adequate corpus, resulting in a significantly enhanced language model. This method is based on the observation that a small corpus from the right domain has high quality n-grams but has serious sparseness problem, while a large corpus from a different domain has more n-gram statistics but inadequately biased. Two n-gram models are combined by extending the idea of Katz's backoff. We ran experiments with 3-gram language models constructed from newspaper corpora of several million to tens of million words together with models from smaller broadcast news corpora. The target domain was broadcast news. We obtained significant improvement (30%) by incorporating a small corpus around one thirtieth size of the newspaper corpus. © Springer-Verlag 2004.

Cite

CITATION STYLE

APA

Cho, S., Kim, S. H., Park, J., & Lee, Y. J. (2004). Overcoming the sparseness problem of spoken language corpora using other large corpora of distinct characteristics. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2945, 407–411. https://doi.org/10.1007/978-3-540-24630-5_48

Overcoming the sparseness problem of spoken language corpora using other large corpora of distinct characteristics

Abstract

Cite

Register to see more suggestions