Site-level web template extraction based on DOM analysis

Juliàn Alarte; David Insa; Josep Silva; Salvador Tamarit

Conference Proceedings

Site-level web template extraction based on DOM analysis

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2016) 9609 36-49

DOI: 10.1007/978-3-319-41579-6_4

4Citations

4Readers

Get full text

Abstract

One of the main development resources for website engineers are Web templates. Templates allow them to increase productivity by plugin content into already formatted and prepared pagelets. For the final user templates are also useful, because they provide uniformity and a common look and feel for all webpages. However, from the point of view of crawlers and indexers, templates are an important problem, because templates usually contain irrelevant information such as advertisements, menus, and banners. Processing and storing this information leads to a waste of resources (storage space, bandwidth, etc.). It has been measured that templates represent between 40% and 50% of data on the Web. Therefore, identifying templates is essential for indexing tasks. In this work we propose a novel method for automatic web template extraction that is based on similarity analysis between the DOM trees of a collection of webpages that are detected using an hyperlink analysis. Our implementation and experiments demonstrate the usefulness of the technique.

Author supplied keywords

Cite

CITATION STYLE

APA

Alarte, J., Insa, D., Silva, J., & Tamarit, S. (2016). Site-level web template extraction based on DOM analysis. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 9609, pp. 36–49). Springer Verlag. https://doi.org/10.1007/978-3-319-41579-6_4

Site-level web template extraction based on DOM analysis

Abstract

Author supplied keywords

Cite

Register to see more suggestions