Web Page Segmentation Revisited: Evaluation Framework and Dataset

16Citations
Citations of this article
18Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Each web page can be segmented into semantically coherent units that fulfill specific purposes. Though the task of automatic web page segmentation was introduced two decades ago, along with several applications in web content analysis, its foundations are still lacking. Specifically, the developed evaluation methods and datasets presume a certain downstream task, which led to a variety of incompatible datasets and evaluation methods. To address this shortcoming, we contribute two resources: (1) An evaluation framework which can be adjusted to downstream tasks by measuring the segmentation similarity regarding visual, structural, and textual elements, and which includes measures for annotator agreement, segmentation quality, and an algorithm for segmentation fusion. (2) The Webis-WebSeg-20 dataset, comprising 42,450∼crowdsourced segmentations for 8,490∼web pages, outranging existing sources by an order of magnitude. Our results help to better understand the "mental segmentation model'' of human annotators: Among other things we find that annotators mostly agree on segmentations for all kinds of web page elements (visual, structural, and textual). Disagreement exists mostly regarding the right level of granularity, indicating a general agreement on the visual structure of web pages.

Cite

CITATION STYLE

APA

Kiesel, J., Kneist, F., Meyer, L., Komlossy, K., Stein, B., & Potthast, M. (2020). Web Page Segmentation Revisited: Evaluation Framework and Dataset. In International Conference on Information and Knowledge Management, Proceedings (pp. 3047–3054). Association for Computing Machinery. https://doi.org/10.1145/3340531.3412782

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free