Abstract
Morphological analysis (MA) and lexical normalization (LN) are both important tasks for Japanese user-generated text (UGT). To evaluate and compare different MA/LN systems, we have constructed a publicly available Japanese UGT corpus. Our corpus comprises 929 sentences annotated with morphological and normalization information, along with category information we classified for frequent UGT-specific phenomena. Experiments on the corpus demonstrated the low performance of existing MA/LN methods for non-general words and non-standard forms, indicating that the corpus would be a challenging benchmark for further research on UGT.
Cite
CITATION STYLE
Higashiyama, S., Utiyama, M., Watanabe, T., & Sumita, E. (2021). User-Generated Text Corpus for Evaluating Japanese Morphological Analysis and Lexical Normalization. In NAACL-HLT 2021 - 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (pp. 5532–5541). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2021.naacl-main.438
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.