Classification of heterogeneous text data for robust domain-specific language modeling

19Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

The robustness of n-gram language models depends on the quality of text data on which they have been trained. The text corpora collected from various resources such as web pages or electronic documents are characterized by many possible topics. In order to build efficient and robust domain-specific language models, it is necessary to separate domain-oriented segments from the large amount of text data, and the remaining out-of-domain data can be used only for updating of existing in-domain n-gram probability estimates. In this paper, we describe the process of classification of heterogeneous text data into two classes, to the in-domain and out-of-domain data, mainly used for language modeling in the task-oriented speech recognition from judicial domain. The proposed algorithm for text classification is based on detection of theme in short text segments based on the most frequent key phrases. In the next step, each text segment is represented in vector space model as a feature vector with term weighting. For classification of these text segments to the in-domain and out-of domain area, document similarity with automatic thresholding are used. The experimental results of modeling the Slovak language and adaptation to the judicial domain show significant improvement in the model perplexity and increasing the performance of the Slovak transcription and dictation system. © 2014 Staš et al.; licensee Springer.

References Powered by Scopus

Text categorization with support vector machines: Learning with many relevant features

4948Citations
N/AReaders
Get full text

An effective refinement strategy for KNN text classifier

275Citations
N/AReaders
Get full text

Augmenting naive Bayes classifiers with statistical language models

192Citations
N/AReaders
Get full text

Cited by Powered by Scopus

Text classification techniques: A literature review

108Citations
N/AReaders
Get full text

Evaluating the performance of sentence level features and domain sensitive features of product reviews on supervised sentiment analysis tasks

25Citations
N/AReaders
Get full text

Learning string distance with smoothing for OCR spelling correction

13Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Staš, J., Juhár, J., & Hládek, D. (2014). Classification of heterogeneous text data for robust domain-specific language modeling. Eurasip Journal on Audio, Speech, and Music Processing, 2014. https://doi.org/10.1186/1687-4722-2014-14

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 7

50%

Professor / Associate Prof. 3

21%

Researcher 3

21%

Lecturer / Post doc 1

7%

Readers' Discipline

Tooltip

Computer Science 7

58%

Engineering 3

25%

Arts and Humanities 1

8%

Mathematics 1

8%

Save time finding and organizing research with Mendeley

Sign up for free