On the construction of a large scale Chinese Web Test collection

Hongfei Yan; Chong Chen; Bo Peng; Xiaoming Li

Conference Proceedings

On the construction of a large scale Chinese Web Test collection

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2008) 4993 LNCS 117-128

DOI: 10.1007/978-3-540-68636-1_12

0Citations

3Readers

Get full text

Abstract

The lack of a large scale Chinese test collection is an obstacle to the Chinese information retrieval development. In order to address this issue, we built such a collection composed of millions of Chinese web pages, known as the Chinese Web Test collection with 100 gigabyte (CWT100g) in data volume, which is the largest Chinese web test collection as of this writing, and has been used by several dozen research groups besides being adopted in the evaluation of the SEWM-2004 Chinese Web Track[1] and the HTRDPE-2004[2]. We present the total solution for constructing a large scale test collection like the CWT100g. Further, we found that: 1) the distribution of the number of pages within sites obeys a Zipf-like law instead of a power law proposed by Adamic and Huberman [3, 4]; 2) and an appropriate filtering method on host alias will economize resources for about 25% while crawling pages. The Zipf-like law and the method of filtering host alias proposed in the paper will facilitate both to model the Web and to perfect a search engine. Finally, we report on the results of the SEWM-2004 Chinese Web Track. © 2008 Springer-Verlag Berlin Heidelberg.

Author supplied keywords

Cite

CITATION STYLE

APA

Yan, H., Chen, C., Peng, B., & Li, X. (2008). On the construction of a large scale Chinese Web Test collection. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4993 LNCS, pp. 117–128). https://doi.org/10.1007/978-3-540-68636-1_12

On the construction of a large scale Chinese Web Test collection

Abstract

Author supplied keywords

Cite

Register to see more suggestions