Improving Thai word and sentence segmentation using linguistic knowledge

Rungsiman Nararatwong; Natthawut Kertkeidkachorn; Nagul Cooharojananone; Hitoshi Okada

Journal ArticleOPEN ACCESS

Improving Thai word and sentence segmentation using linguistic knowledge

IEICE Transactions on Information and Systems (2018) E101D(12) 3218-3225

DOI: 10.1587/transinf.2018EDP7016

15Citations

11Readers

Abstract

Word boundary ambiguity in word segmentation has long been a fundamental challenge within Thai language processing. The Conditional Random Fields (CRF) model is among the best-known methods to have achieved remarkably accurate segmentation. Nevertheless, current advancements appear to have left the problem of compound words unaccounted for. Compound words lose their meaning or context once segmented. Hence, we introduce a dictionary-based word-merging algorithm, which merges all kinds of compound words. Our evaluation shows that the algorithm can accomplish a high-accuracy of word segmentation, with compound words being preserved. Moreover, it can also restore some incorrectly segmented words. Another problem involving a different word-chunking approach is sentence boundary ambiguity. In tackling the problem, utilizing the part of speech (POS) of a segmented word has been found previously to help boost the accuracy of CRF-based sentence segmentation. However, not all segmented words can be tagged. Thus, we propose a POS-based word-splitting algorithm, which splits words in order to increase POS tags. We found that with more identifiable POS tags, the CRF model performs better in segmenting sentences. To demonstrate the contributions of both methods, we experimented with three of their applications. With the word merging algorithm, we found that intact compound words in the product of topic extraction can help to preserve their intended meanings, offering more precise information for human interpretation. The algorithm, together with the POS-based word-splitting algorithm, can also be used to amend word-level Thai-English translations. In addition, the word-splitting algorithm improves sentence segmentation, thus enhancing text summarization.

Author supplied keywords

Cite

CITATION STYLE

APA

Nararatwong, R., Kertkeidkachorn, N., Cooharojananone, N., & Okada, H. (2018). Improving Thai word and sentence segmentation using linguistic knowledge. IEICE Transactions on Information and Systems, E101D(12), 3218–3225. https://doi.org/10.1587/transinf.2018EDP7016

Improving Thai word and sentence segmentation using linguistic knowledge

Abstract

Author supplied keywords

Cite

Register to see more suggestions