Unsupervised text normalization approach for morphological analysis of blog documents

4Citations
Citations of this article
3Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we propose an algorithm for reducing the number of unknown words on blog documents by replacing peculiar expressions with formal expressions. Japanese blog documents contain many peculiar expressions regarded as unknown sequences by morphological analyzers. Reducing these unknown sequences improves the accuracy of morphological analysis for blog documents. Manual registration of peculiar expressions to the morphological dictionaries is a conventional solution, which is costly and requires specialized knowledge. In our algorithm, substitution candidates of peculiar expressions are automatically retrieved from formally written documents such as newspapers and stored as substitution rules. For the correct replacement, a substitution rule is selected based on three criteria; its appearance frequency in retrieval process, the edit distance between substituted sequences and the original text, and the estimated accuracy improvements of word segmentation after the substitution. Experimental results show our algorithm reduces the number of unknown words by 30.3%, maintaining the same segmentation accuracy as the conventional methods, which is twice the reduction rate of the conventional methods. © Springer-Verlag Berlin Heidelberg 2009.

Cite

CITATION STYLE

APA

Ikeda, K., Yanagihara, T., Matsumoto, K., & Takishima, Y. (2009). Unsupervised text normalization approach for morphological analysis of blog documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5866 LNAI, pp. 401–411). https://doi.org/10.1007/978-3-642-10439-8_41

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free