Selecting relevant text subsets from web-data for building topic specific language models

Abhinav Sethy; Panayiotis G. Georgiou; Shrikanth Narayanan

Conference Proceedings

Selecting relevant text subsets from web-data for building topic specific language models

HLT-NAACL 2006 - Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Short Papers (2006) 145-148

DOI: 10.3115/1614049.1614086

10Citations

88Readers

Get full text

Abstract

In this paper we present a scheme to select relevant subsets of sentences from a large generic corpus such as text acquired from the web. A relative entropy (R.E) based criterion is used to incrementally select sentences whose distribution matches the domain of interest. Experimental results show that by using the proposed subset selection scheme we can get significant performance improvement in both Word Error Rate (WER) and Perplexity (PPL) over the models built from the entire web-corpus by using just 10% of the data. In addition incremental data selection enables us to achieve significant reduction in the vocabulary size as well as number of n-grams in the adapted language model. To demonstrate the gains from our method we provide a comparative analysis with a number of methods proposed in recent language modeling literature for cleaning up text.

Cite

CITATION STYLE

APA

Sethy, A., Georgiou, P. G., & Narayanan, S. (2006). Selecting relevant text subsets from web-data for building topic specific language models. In HLT-NAACL 2006 - Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, Short Papers (pp. 145–148). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1614049.1614086

Selecting relevant text subsets from web-data for building topic specific language models

Abstract

Cite

Register to see more suggestions