Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

Dekai Wu; Pascale Fung

Conference Proceedings

Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

4th Conference on Applied Natural Language Processing, ANLP 1994 - Proceedings (1994) 180-181

DOI: 10.3115/974358.974399

37Citations

81Readers

Get full text

Abstract

The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang &Chen 1993; Chiang et al. 1992; Linet al. 1993; Wu & Tseng 1993; Sproat et al. 1994). We present empirical evidence for four points concerning tokenization of Chinese text: (I) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.

Cite

CITATION STYLE

APA

Wu, D., & Fung, P. (1994). Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. In 4th Conference on Applied Natural Language Processing, ANLP 1994 - Proceedings (pp. 180–181). Association for Computational Linguistics (ACL). https://doi.org/10.3115/974358.974399

Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

Abstract

Cite

Register to see more suggestions