Comparing representations in chinese information retrieval

K. L. Kwok

Journal ArticleOPEN ACCESS

Comparing representations in chinese information retrieval

Kwok K

SIGIR Forum (ACM Special Interest Group on Information Retrieval) (1997) 31(1 SPEC. ISS.) 34-39

DOI: 10.1145/278459.258531

79Citations

9Readers

Abstract

Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach. Copyright 1997 ACM.

Cite

CITATION STYLE

APA

Kwok, K. L. (1997). Comparing representations in chinese information retrieval. SIGIR Forum (ACM Special Interest Group on Information Retrieval), 31(1 SPEC. ISS.), 34–39. https://doi.org/10.1145/278459.258531

Comparing representations in chinese information retrieval

Abstract

Cite

Register to see more suggestions