SizeSpotSigs: An effective deduplicate algorithm considering the size of page content

4Citations
Citations of this article
6Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Detecting if two Web pages are near replicas, in terms of their contents rather than files, is of great importance in many web information based applications. As a result, many deduplicating algorithms have been proposed. Nevertheless, analysis and experiments show that existing algorithms usually don't work well for short Web pages, due to relatively large portion of noisy information, such as ads and templates for websites, existing in the corresponding files. In this paper, we analyze the critical issues in deduplicating short Web pages and present an algorithm (AF-SpotSigs) that incorporates them, which could work 15% better than the state-of-the-art method. Then we propose an algorithm (SizeSpotSigs), taking the size of page contents into account, which could handle both short and long Web pages. The contributions of SizeSpotSigs are three-fold: 1) Provide an analysis about the relation between noise-content ratio and similarity, and propose two rules of making the methods work better; 2) Based on the analysis, for Chinese, we propose 3 new features to improve the effectiveness for short Web pages; 3) We present an algorithm named SizeSpotSigs for near duplicate detection considering the size of the core content in Web page. Experiments confirm that SizeSpotSigs works better than state-of-the-art approaches such as SpotSigs, over a demonstrative Mixer of manually assessed near-duplicate news articles, which include both short and long Web pages. © 2011 Springer-Verlag.

Cite

CITATION STYLE

APA

Mao, X., Liu, X., Di, N., Li, X., & Yan, H. (2011). SizeSpotSigs: An effective deduplicate algorithm considering the size of page content. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 6634 LNAI, pp. 537–548). Springer Verlag. https://doi.org/10.1007/978-3-642-20841-6_44

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free