Relevance weighting for combining multi-domain data for n-gram language modeling

32Citations
Citations of this article
12Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Standard statistical language modeling techniques suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. In this paper, we focus on improving the estimation of domain-dependent n-gram models by the selective use of out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this work aims at differentially weighting subsets of the out-of-domain data according to style and/or content similarity to the given task, where `style' is represented by part-of-speech statistics and `content' by the particular choice of vocabulary items. In addition to n-gram estimation, the differential weights can be used for lexicon design. Recognition experiments are based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. The similarity weighting approach gives a 3-5% reduction in word error rate over a domain-specific n-gram language model, providing some of the largest language modeling gains reported for the Switchboard task in recent year.

Cite

CITATION STYLE

APA

Iyer, R., & Ostendorf, M. (1999). Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language, 13(3), 267–282. https://doi.org/10.1006/csla.1999.0124

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free