A novel word segmentation approach for written languages with word boundary markers

0Citations
Citations of this article
83Readers
Mendeley users who have this article in their library.

Abstract

Most NLP applications work under the assumption that a user input is error-free; thus, word segmentation (WS) for written languages that use word boundary markers (WBMs), such as spaces, has been regarded as a trivial issue. However, noisy real-world texts, such as blogs, e-mails, and SMS, may contain spacing errors that require correction before further processing may take place. For the Korean language, many researchers have adopted a traditional WS approach, which eliminates all spaces in the user input and re-inserts proper word boundaries. Unfortunately, such an approach often exacerbates the word spacing quality for user input, which has few or no spacing errors; such is the case, because a perfect WS model does not exist. In this paper, we propose a novel WS method that takes into consideration the initial word spacing information of the user input. Our method generates a better output than the original user input, even if the user input has few spacing errors. Moreover, the proposed method significantly outperforms a state-of-the-art Korean WS model when the user input initially contains less than 10% spacing errors, and performs comparably for cases containing more spacing errors. We believe that the proposed method will be a very practical pre-processing module. © 2009 ACL and AFNLP.

References Powered by Scopus

Automatic word spacing using probabilistic models based on character n-grams

19Citations
N/AReaders
Get full text

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Cite

CITATION STYLE

APA

Cho, H. C., Lee, D. G., Lee, J. T., Stenetorp, P., Tsujii, J., & Rim, H. C. (2009). A novel word segmentation approach for written languages with word boundary markers. In ACL-IJCNLP 2009 - Joint Conf. of the 47th Annual Meeting of the Association for Computational Linguistics and 4th Int. Joint Conf. on Natural Language Processing of the AFNLP, Proceedings of the Conf. (pp. 29–32). Association for Computational Linguistics (ACL). https://doi.org/10.3115/1667583.1667594

Readers' Seniority

Tooltip

PhD / Post grad / Masters / Doc 29

63%

Researcher 10

22%

Professor / Associate Prof. 5

11%

Lecturer / Post doc 2

4%

Readers' Discipline

Tooltip

Computer Science 36

80%

Linguistics 6

13%

Medicine and Dentistry 2

4%

Neuroscience 1

2%

Save time finding and organizing research with Mendeley

Sign up for free