WooIR: A New Open Page Stream Segmentation Dataset

2Citations
Citations of this article
7Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this work we presentWooIR, an open realistic benchmark for Page Stream Segmentation (PSS), the task of recovering document boundaries from aggregatedstreams of pages. Our dataset consists of over 200 streams of scanned in documents, 7K documents, 45K pages and 10M words, originating from documents released by the Dutch government in response to requests made under the Freedom of Information Act. Apart from the introduction of the dataset we perform several baseline experiments on the dataset and compare six metrics for the PSS task, in an attempt to unify the field in the usage of evaluation metrics more suited to the task. Analysis of the six metrics on the WooIR dataset shows that the dataset contains a good balance of easy and hard samples. The Panoptic Quality metric from the image segmentation field seems the most appropriate evaluation metric for the PSS task.

Cite

CITATION STYLE

APA

Van Heusden, R., Kamps, J., & Marx, M. (2022). WooIR: A New Open Page Stream Segmentation Dataset. In ICTIR 2022 - Proceedings of the 2022 ACM SIGIR International Conference on the Theory of Information Retrieval (pp. 24–33). Association for Computing Machinery, Inc. https://doi.org/10.1145/3539813.3545150

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free