An efficient algorithm for unsupervised word segmentation with branching entropy and MDL

0Citations
Citations of this article
124Readers
Mendeley users who have this article in their library.

Abstract

This paper proposes a fast and simple unsupervised word segmentation algorithm that utilizes the local predictability of adjacent character sequences, while searching for a least-effort representation of the data. The model uses branching entropy as a means of constraining the hypothesis space, in order to efficiently obtain a solution that minimizes the length of a two-part MDL code. An evaluation with corpora in Japanese, Thai, English, and the "CHILDES" corpus for research in language development reveals that the algorithm achieves a F-score, comparable to that of the state-of-the-art methods in unsupervised word segmentation, in a significantly reduced computational time. In view of its capability to induce the vocabulary of large-scale corpora of domain-specific text, the method has potential to improve the coverage of morphological analyzers for languages without explicit word boundary markers. A semi-supervised word segmentation approach is also proposed, in which the word boundaries obtained through the unsupervised model are used as features for a state-of-the-art word segmentation method.

Cite

CITATION STYLE

APA

Zhikov, V., Takamura, H., & Okumura, M. (2013). An efficient algorithm for unsupervised word segmentation with branching entropy and MDL. Medicinal Plants, 5(1), 347–360. https://doi.org/10.1527/tjsai.28.347

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free