Building a Japanese typo dataset from wikipedia’s revision history

Yu Tanaka; Yugo Murawaki; Daisuke Kawahara; Sadao Kurohashi

Conference ProceedingsOPEN ACCESS

Building a Japanese typo dataset from wikipedia’s revision history

Proceedings of the Annual Meeting of the Association for Computational Linguistics (2020) 230-236

DOI: 10.18653/v1/2020.acl-srw.31

6Citations

73Readers

Abstract

User generated texts contain many typos for which correction is necessary for NLP systems to work. Although a large number of typo–correction pairs are needed to develop a data-driven typo correction system, no such dataset is available for Japanese. In this paper, we extract over half a million Japanese typo–correction pairs from Wikipedia’s revision history. Unlike other languages, Japanese poses unique challenges: (1) Japanese texts are unsegmented so that we cannot simply apply a spelling checker, and (2) the way people inputting kanji logographs results in typos with drastically different surface forms from correct ones. We address them by combining character-based extraction rules, morphological analyzers to guess readings, and various filtering methods. We evaluate the dataset using crowdsourcing and run a baseline seq2seq model for typo correction.

Cite

CITATION STYLE

APA

Tanaka, Y., Murawaki, Y., Kawahara, D., & Kurohashi, S. (2020). Building a Japanese typo dataset from wikipedia’s revision history. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 230–236). Association for Computational Linguistics (ACL). https://doi.org/10.18653/v1/2020.acl-srw.31

Building a Japanese typo dataset from wikipedia’s revision history

Abstract

Cite

Register to see more suggestions