A comparison of two unsupervised table recognition methods from digital scientific articles

Stefan Klampfl; Kris Jack; Roman Kern

Journal Article

A comparison of two unsupervised table recognition methods from digital scientific articles

D-Lib Magazine (2014) 20(11-12)

DOI: 10.1045/november14-klampfl

14Citations

35Readers

Get full text

Abstract

In digital scientific articles tables are a common form of presenting information in a structured way. However, the large variability of table layouts and the lack of structural information in digital document formats pose significant challenges for information retrieval and related tasks. In this paper we present two table recognition methods based on unsupervised learning techniques and heuristics which automatically detect both the location and the structure of tables within a article stored as PDF. For both algorithms the table region detection first identifies the bounding boxes of individual tables from a set of labelled text blocks. In the second step, two different tabular structure detection methods extract a rectangular grid of table cells from the set of words contained in these table regions. We evaluate each stage of the algorithms separately and compare performance values on two data sets from different domains. We find that the table recognition performance is in line with state-of-the-art commercial systems and generalises to the non-scientific domain.

Author supplied keywords

Cite

CITATION STYLE

APA

Klampfl, S., Jack, K., & Kern, R. (2014). A comparison of two unsupervised table recognition methods from digital scientific articles. D-Lib Magazine, 20(11–12). https://doi.org/10.1045/november14-klampfl

A comparison of two unsupervised table recognition methods from digital scientific articles

Abstract

Author supplied keywords

Cite

Register to see more suggestions