In this work we presentWooIR, an open realistic benchmark for Page Stream Segmentation (PSS), the task of recovering document boundaries from aggregatedstreams of pages. Our dataset consists of over 200 streams of scanned in documents, 7K documents, 45K pages and 10M words, originating from documents released by the Dutch government in response to requests made under the Freedom of Information Act. Apart from the introduction of the dataset we perform several baseline experiments on the dataset and compare six metrics for the PSS task, in an attempt to unify the field in the usage of evaluation metrics more suited to the task. Analysis of the six metrics on the WooIR dataset shows that the dataset contains a good balance of easy and hard samples. The Panoptic Quality metric from the image segmentation field seems the most appropriate evaluation metric for the PSS task.
CITATION STYLE
Van Heusden, R., Kamps, J., & Marx, M. (2022). WooIR: A New Open Page Stream Segmentation Dataset. In ICTIR 2022 - Proceedings of the 2022 ACM SIGIR International Conference on the Theory of Information Retrieval (pp. 24–33). Association for Computing Machinery, Inc. https://doi.org/10.1145/3539813.3545150
Mendeley helps you to discover research relevant for your work.