Automatic extraction of figures from scientific publications in high-energy physics

Piotr Adam Praczyk; Javier Nogueras-Iso; Salvatore Mele

Journal ArticleOPEN ACCESS

Automatic extraction of figures from scientific publications in high-energy physics

Information Technology and Libraries (2013) 32(4) 25-52

DOI: 10.6017/ital.v32i4.3670

21Citations

25Readers

Abstract

Plots and figures play an important role in the process of understanding a scientific publication, providing overviews of large amounts of data or ideas that are difficult to intuitively present using only the text. State-of-the-art digital libraries, which serve as gateways to knowledge encoded in scholarly writings, do not yet take full advantage of the graphical content of documents. Enabling machines to automatically unlock the meaning of scientific illustrations would allow immense improvements in the way scientists work and the way knowledge is processed. In this paper, we present a novel solution for the initial problem of processing graphical content, obtaining figures from scholarly publications stored in PDF. Our method relies on vector properties of documents and, as such, does not introduce additional errors, unlike methods based on raster image processing. Emphasis has been placed on correctly processing documents in high-energy physics. The described approach distinguishes different classes of objects appearing in PDF documents and uses spatial clustering techniques to group objects into larger logical entities. Many heuristics allow the rejection of incorrect figure candidates and the extraction of different types of metadata.

Cite

CITATION STYLE

APA

Praczyk, P. A., Nogueras-Iso, J., & Mele, S. (2013). Automatic extraction of figures from scientific publications in high-energy physics. Information Technology and Libraries, 32(4), 25–52. https://doi.org/10.6017/ital.v32i4.3670

Automatic extraction of figures from scientific publications in high-energy physics

Abstract

Cite

Register to see more suggestions