A Simple and Effective Unsupervised Word Segmentation Approach

0Citations
Citations of this article
43Readers
Mendeley users who have this article in their library.

Abstract

In this paper, we propose a new unsupervised approach for word segmentation. The core idea of our approach is a novel word induction criterion called WordRank, which estimates the goodness of word hypotheses (character or phoneme sequences). We devise a method to derive exterior word boundary information from the link structures of adjacent word hypotheses and incorporate interior word boundary information to complete the model. In light of WordRank, word segmentation can be modeled as an optimization problem. A Viterbi-styled algorithm is developed for the search of the optimal segmentation. Extensive experiments conducted on phonetic transcripts as well as standard Chinese and Japanese data sets demonstrate the effectiveness of our approach. On the standard Brent version of Bernstein-Ratner corpora, our approach outperforms the state-ofthe-art Bayesian models by more than 3%. Plus, our approach is simpler and more efficient than the Bayesian methods. Consequently, our approach is more suitable for real-world applications.

Cite

CITATION STYLE

APA

Chen, S., Xu, Y., & Chang, H. (2011). A Simple and Effective Unsupervised Word Segmentation Approach. In Proceedings of the 25th AAAI Conference on Artificial Intelligence, AAAI 2011 (pp. 866–871). AAAI Press. https://doi.org/10.1609/aaai.v25i1.7970

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free