Classification of heterogeneous text data for robust domain-specific language modeling

Ján Staš; Jozef Juhár; Daniel Hládek

Journal ArticleOPEN ACCESS

Classification of heterogeneous text data for robust domain-specific language modeling

Eurasip Journal on Audio, Speech, and Music Processing (2014) 2014

DOI: 10.1186/1687-4722-2014-14

19Citations

18Readers

Abstract

The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system. © 2014 Staš et al.; licensee Springer.

Author supplied keywords

References Powered by Scopus

View more at Scopus

Cited by Powered by Scopus

View more at Scopus

Cite

CITATION STYLE

APA

Staš, J., Juhár, J., & Hládek, D. (2014). Classification of heterogeneous text data for robust domain-specific language modeling. Eurasip Journal on Audio, Speech, and Music Processing, 2014. https://doi.org/10.1186/1687-4722-2014-14

Readers' Seniority

PhD / Post grad / Masters / Doc 7

50%

Professor / Associate Prof. 3

21%

Researcher 3

21%

Lecturer / Post doc 1

Readers' Discipline

Computer Science 7

58%

Engineering 3

25%

Arts and Humanities 1

Mathematics 1

Classification of heterogeneous text data for robust domain-specific language modeling

Abstract

Author supplied keywords

References Powered by Scopus

Text categorization with support vector machines: Learning with many relevant features

An effective refinement strategy for KNN text classifier

Augmenting naive Bayes classifiers with statistical language models

Cited by Powered by Scopus

Text classification techniques: A literature review

Evaluating the performance of sentence level features and domain sensitive features of product reviews on supervised sentiment analysis tasks

Learning string distance with smoothing for OCR spelling correction

Register to see more suggestions

Cite

Readers' Seniority

Readers' Discipline