Weighted shingling: An adaptation of shingling for weighted shingles

1Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Broder 's shingling is one of the state-of-the-art approaches in detecting near-duplicate documents. Prior evaluations of this method have shown that document-pairs which have different main content but have a large amount ofsimilar unimportant details are the main sources of its errors. Different web pages from the same site are a good example of such documents. In such pages, almost always there is a similar boilerplate text which has a chance to be selected as the document's fingerprint and trick the algorithm. It seems that this problem is due to representing each document only by a sample of its shingles. This sample only contains some ofthe page's shingles and discards any other information. by Including additional information such as frequencies of shingles in this sample, we can improve the performance ofthe algorithm. This paper proposes a weighting of shingles and adapts shingling to be applied on weighted shingles. Our results have shown an improvement in shingling's performance. ©2009 IEEE.

Cite

CITATION STYLE

APA

Gharghe, Z. E., & Bidgoli, B. M. (2009). Weighted shingling: An adaptation of shingling for weighted shingles. In 2009 International Conference on Innovations in Information Technology, IIT ’09 (pp. 150–154). https://doi.org/10.1109/IIT.2009.5413370

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free