An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

Valentin Zhikov; Hiroya Takamura; Manabu Okumura

Journal ArticleOPEN ACCESS

An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

Medicinal Plants (2013) 5(1) 347-360

DOI: 10.1527/tjsai.28.347

0Citations

124Readers

Abstract

This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves a F-score, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time. In view of its capability to induce the vocabulary of large-scale corpora of domain-specific text, the method has potential to improve the coverage of morphological analyzers for languages without explicit word boundary markers. A semi-supervised word segmentation approach is also proposed, in which the word boundaries obtained through the unsupervised model are used as features for a state-of-the-art word segmentation method.

Author supplied keywords

Cite

CITATION STYLE

APA

Zhikov, V., Takamura, H., & Okumura, M. (2013). An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. Medicinal Plants, 5(1), 347–360. https://doi.org/10.1527/tjsai.28.347

An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

Abstract

Author supplied keywords

Cite

Register to see more suggestions