Abstract
Three representation methods are empirically investigated for Chinese information retrieval: 1-gram (single character), bigram (two contiguous overlapping characters), and short-word indexing based on a simple segmentation of the text. The retrieval collection is the approximately 170 MB TREC-5 Chinese corpus of news articles, and 28 queries that are long and rich in wordings. Evaluation shows that 1-gram indexing is good but not sufficiently competitive, while bigram indexing works surprisingly well. Bigram indexing leads to a large index term space, three times that of short-word indexing, but is as good as short-word indexing in precision, and about 5% better in relevants retrieved. The best average non-interpolated precision is about 0.45, 17% better than 1-gram indexing and quite high for a mainly statistical approach. Copyright 1997 ACM.
Cite
CITATION STYLE
Kwok, K. L. (1997). Comparing representations in chinese information retrieval. SIGIR Forum (ACM Special Interest Group on Information Retrieval), 31(1 SPEC. ISS.), 34–39. https://doi.org/10.1145/278459.258531
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.