Abstract
Standard statistical language modeling techniques suffer from sparse-data problems in tasks where large amounts of domain-specific text are not available. In this paper, we focus on improving the estimation of domain-dependent n-gram models by the selective use of out-of-domain text data. Previous approaches for estimating language models from multi-domain data have not accounted for the characteristic variations of style and content across domains. In contrast, this work aims at differentially weighting subsets of the out-of-domain data according to style and/or content similarity to the given task, where `style' is represented by part-of-speech statistics and `content' by the particular choice of vocabulary items. In addition to n-gram estimation, the differential weights can be used for lexicon design. Recognition experiments are based on the Switchboard corpus of spontaneous conversations, with out-of-domain text drawn from the Wall Street Journal and Broadcast News corpora. The similarity weighting approach gives a 3-5% reduction in word error rate over a domain-specific n-gram language model, providing some of the largest language modeling gains reported for the Switchboard task in recent year.
Cite
CITATION STYLE
Iyer, R., & Ostendorf, M. (1999). Relevance weighting for combining multi-domain data for n-gram language modeling. Computer Speech and Language, 13(3), 267–282. https://doi.org/10.1006/csla.1999.0124
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.