Understanding content reuse on the web: Static and dynamic analyses

2Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper we present static and dynamic studies of duplicate and near-duplicate documents in the Web. The static and dynamic studies involve the analysis of similar content among pages within a given snapshot of the Web and how pages in an old snapshot are reused to compose new documents in a more recent snapshot. We ran a series of experiments using four snapshots of the Chilean Web. In the static study, we identify duplicates in both parts of the Web graph - reachable (connected by links) and unreachable components (unconnected) - aiming to identify where duplicates occur more frequently. We show that the number of duplicates in the Web seems to be much higher than previously reported (about 50% higher) and in our data the duplicated in the unreachable Web is 74,6% higher than the number of duplicates in the reachable component of the Web graph. In the dynamic study, we show that some of the old content is used to compose new pages. If a page in a newer snapshot has content of a page in an older snapshot, we say that the source is a parent of the new page. We state the hypothesis that people use search engines to find pages and republish their content as a new document. We present evidences that this happens for part of the pages that have parents. In this case, part of the Web content is biased by the ranking function of search engines. © Springer-Verlag Berlin Heidelberg 2007.

Cite

CITATION STYLE

APA

Baeza-Yates, R., Pereira, Á., & Ziviani, N. (2007). Understanding content reuse on the web: Static and dynamic analyses. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 4811 LNAI, pp. 227–246). Springer Verlag. https://doi.org/10.1007/978-3-540-77485-3_13

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free