ソーシャルメディア等の崩れた日本語の解析においては,形態素解析辞書に存在しない語が多く出現するため解析誤りが新聞等のテキストに比べ増加する.辞書に存在しない未知語の中でも,既知の辞書語からの派生に関しては,正規形を考慮しながら解析するという表記正規化との同時解析の有効性が確認されている.本研究では,これまで焦点があてられていなかった,文字列の正規化パタン獲得に着目し,アノテーションデータから文字列の正規化パタンを統計的に抽出する.統計的に抽出した文字列正規化パタンと文字種正規化を用いて辞書語の候補を拡張し形態素解析を行った結果,従来法よりも再現率,精度ともに高い解析結果を得ることができた. Social media texts are often written in a non-standard style and include many lexi-cal variants such as insertions, phonetic substitutions, and abbreviations that mimic spoken language. The normalization of such a variety of non-standard tokens is one promising solution for handling noisy text. A normalization task is very difficult for the morphological analysis of Japanese text because there are no explicit bound-aries between words. To address this issue, we propose a novel method herein for normalizing and morphologically analyzing Japanese noisy text. First, we extract character-level transformation patterns based on a character alignment model using annotated data. Next, we generate both character-level and word-level normaliza-tion candidates using character transformation patterns and search for the optimal path based on a discriminative model. Experimental results show that the proposed method exceeds conventional rule-based system in both accuracy and recall for word segmentation and POS (Part of Speech) tagging.
CITATION STYLE
Saito, I., Sadamitsu, K., Asano, H., & Matsuo, Y. (2017). Morphological Analysis for Japanese Noisy Text based on Extraction of Character Transformation Patterns and Lexical Normalization. Journal of Natural Language Processing, 24(2), 297–314. https://doi.org/10.5715/jnlp.24.297
Mendeley helps you to discover research relevant for your work.