A hybrid approach to word segmentation of Vietnamese texts

Lê Hông Phuong; Nguyên Thi Minh Huyên; Azim Roussanaly; Hô Tuòng Vinh

Conference Proceedings

A hybrid approach to word segmentation of Vietnamese texts

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 5196 LNCS 240-249

DOI: 10.1007/978-3-540-88282-4_23

100Citations

34Readers

Get full text

Abstract

We present in this article a hybrid approach to automatically tokenize Vietnamese text. The approach combines both finite-state automata technique, regular expression parsing and the maximal-matching strategy which is augmented by statistical methods to resolve ambiguities of segmentation. The Vietnamese lexicon in use is compactly represented by a minimal finite-state automaton. A text to be tokenized is first parsed into lexical phrases and other patterns using pre-defined regular expressions. The automaton is then deployed to build linear graphs corresponding to the phrases to be segmented. The application of a maximal- matching strategy on a graph results in all candidate segmentations of a phrase. It is the responsibility of an ambiguity resolver, which uses a smoothed bigram language model, to choose the most probable segmentation of the phrase. The hybrid approach is implemented to create vnTokenizer, a highly accurate tokenizer for Vietnamese texts. © 2008 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Phuong, L. H., Huyên, N. T. M., Roussanaly, A., & Vinh, H. T. (2008). A hybrid approach to word segmentation of Vietnamese texts. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 5196 LNCS, pp. 240–249). https://doi.org/10.1007/978-3-540-88282-4_23

A hybrid approach to word segmentation of Vietnamese texts

Abstract

Cite

Register to see more suggestions