Learning URL normalization rules using multiple alignment of sequences

7Citations
Citations of this article
1Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case.We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline. © Springer International Publishing 2013.

Cite

CITATION STYLE

APA

Rodrigues, K. W. L., Cristo, M., De Moura, E. S., & Da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8214 LNCS, pp. 197–205). Springer Verlag. https://doi.org/10.1007/978-3-319-02432-5_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free