The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system. © 2014 Staš et al.; licensee Springer.
Mendeley helps you to discover research relevant for your work.
CITATION STYLE
Staš, J., Juhár, J., & Hládek, D. (2014). Classification of heterogeneous text data for robust domain-specific language modeling. Eurasip Journal on Audio, Speech, and Music Processing, 2014. https://doi.org/10.1186/1687-4722-2014-14