Improving Chinese tokenization with linguistic filters on statistical lexical acquisition

37Citations
Citations of this article
81Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The first step in Chinese NLP is to tokenize or segment character sequences into words, since the text contains no word delimiters. Recent heavy activity in this area has shown the biggest stumbling block to be words that are absent from the lexicon, since successful tokenizers to date have been based on dictionary lookup (e.g., Chang &Chen 1993; Chiang et al. 1992; Linet al. 1993; Wu & Tseng 1993; Sproat et al. 1994). We present empirical evidence for four points concerning tokenization of Chinese text: (I) More rigorous "blind" evaluation methodology is needed to avoid inflated accuracy measurements; we introduce the nk-blind method. (2) The extent of the unknown-word problem is far more serious than generally thought, when tokenizing unrestricted texts in realistic domains. (3) Statistical lexical acquisition is a practical means to greatly improve tokenization accuracy with unknown words, reducing error rates as much as 32.0%. (4) When augmenting the lexicon, linguistic constraints can provide simple inexpensive filters yielding significantly better precision, reducing error rates as much as 49.4%.

Cite

CITATION STYLE

APA

Wu, D., & Fung, P. (1994). Improving Chinese tokenization with linguistic filters on statistical lexical acquisition. In 4th Conference on Applied Natural Language Processing, ANLP 1994 - Proceedings (pp. 180–181). Association for Computational Linguistics (ACL). https://doi.org/10.3115/974358.974399

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free