A hybrid approach to word segmentation of Vietnamese texts

100Citations
Citations of this article
34Readers
Mendeley users who have this article in their library.
Get full text

Abstract

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Phuong, L. H., Huyên, N. T. M., Roussanaly, A., & Vinh, H. T. (2008). A hybrid approach to word segmentation of Vietnamese texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5196 LNCS, pp. 240–249). https://doi.org/10.1007/978-3-540-88282-4_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free