Learning URL normalization rules using multiple alignment of sequences

Kaio Wagner Lima Rodrigues; Marco Cristo; Edleno Silva De Moura; Altigran Soares Da Silva

Conference Proceedings

Learning URL normalization rules using multiple alignment of sequences

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2013) 8214 LNCS 197-205

DOI: 10.1007/978-3-319-02432-5_23

7Citations

1Readers

Get full text

Abstract

In this work, we present DUSTER, a new approach to detect and eliminate redundant content when crawling the web. DUSTER takes advantage of a multi-sequence alignment strategy to learn rewriting rules able to transform URLs to other likely to have similar content, when it is the case.We show the alignment strategy that can lead to a reduction in the number of duplicate URLs 54% larger than the one achieved by our best baseline. © Springer International Publishing 2013.

Cite

CITATION STYLE

APA

Rodrigues, K. W. L., Cristo, M., De Moura, E. S., & Da Silva, A. S. (2013). Learning URL normalization rules using multiple alignment of sequences. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 8214 LNCS, pp. 197–205). Springer Verlag. https://doi.org/10.1007/978-3-319-02432-5_23

Learning URL normalization rules using multiple alignment of sequences

Abstract

Cite

Register to see more suggestions