Large-scale holistic approach to Web block classification: assembling the jigsaws of a Web page puzzle

1Citations
Citations of this article
10Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Web blocks are ubiquitous across the Web. Navigation menus, advertisements, headers, footers, and sidebars can be found almost on any website. Identifying these blocks can be of significant importance for tasks such as wrapper induction, assistance to visually impaired people, Web page topic clustering, and Web search among a few. There have been several approaches to the problem of Web block classification, but they focused on specific types of blocks, trying to classify all of them with one single set of features. In our approach each classifier has its own unique extendable set of features, with the features being extracted through the declarative-based BERyL language, and the classification itself is done through application of machine learning to these feature sets. In our approach we propose to take a holistic view of the page where all block classifiers in the classification system interact with each other, and accuracies of individual classifiers get improved through this interaction. The holistic approach to Web block classification is implemented through a system of constraints in our block classification system BERyL. The evaluation results with the holistic approach applied to the BERyL classification system achieve higher F1 results than for individual non-connected classifiers, with the average F1 of 98%. We also consider the distinction between classification of domain-independent and domain-dependent blocks and propose a large-scale solution to the problem of classification for both of these block types.

Cite

CITATION STYLE

APA

Kravchenko, A. (2019). Large-scale holistic approach to Web block classification: assembling the jigsaws of a Web page puzzle. World Wide Web, 22(5), 1999–2015. https://doi.org/10.1007/s11280-018-0634-6

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free