In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case.We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline. © Springer International Publishing 2013.
CITATION STYLE
Rodrigues, K. W. L., Cristo, M., De Moura, E. S., & Da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8214 LNCS, pp. 197–205). Springer Verlag. https://doi.org/10.1007/978-3-319-02432-5_23
Mendeley helps you to discover research relevant for your work.