Web document duplicate detection using fuzzy hashing

5Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The web is the largest repository of documents available and, for retrieval for various purposes, we must use crawlers to navigate autonomously, to select documents and processing them according to the objectives pursued. However, we can see, even intuitively, that are obtained more or less abundant replications of a significant number of documents. The detection of these duplicates is important because it allows to lighten databases and improve the efficiency of information retrieval engines, but also improve the precision of cybermetric analysis, web mining studies, etc. Hash standard techniques used to detect these duplicates only detect exact duplicates, at the bit level. However, many of the duplicates found in the real world are not exactly alike. For example, we can find web pages with the same content, but with different headers or meta tags, or viewed with style sheets different. A frequent case is that of the same document but in different formats; in these cases we will have completely different documents at binary level. The obvious solution is to compare plain text conversions of all these formats, but these conversions are never identical, because of the different treatments of the converters on various formatting elements (treatment of textual characters, diacritics, spacing, paragraphs ...). In this work we introduce the possibility of using what is known as fuzzy-hashing. The idea is to produce fingerprints of files (or documents, etc..). This way, a comparison between two fingerprints could give us an estimate of the closeness or distance between two files, documents, etc. Based on the concept of "rolling hash", the fuzzy hashing has been used successfully in computer security tasks, such as identifying malware, spam, virus scanning, etc. We have added capabilities of fuzzy hashing to a slight crawler and have made several tests in a heterogeneous network domain, consisting of multiple servers with different software, static and dynamic pages, etc.. These tests allowed us to measure similarity thresholds and to obtain useful data about the quantity and distribution of duplicate documents on web servers. © 2011 Springer-Verlag Berlin Heidelberg.

Cite

CITATION STYLE

APA

Figuerola, C. G., Díaz, R. G., Alonso Berrocal, J. L., & Zazo Rodríguez, A. F. (2011). Web document duplicate detection using fuzzy hashing. In Advances in Intelligent and Soft Computing (Vol. 90, pp. 117–125). https://doi.org/10.1007/978-3-642-19931-8_15

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free