Extraction of references using layout and formatting information from scientific articles

Roman Kern; Stefan Klampfl

Journal Article

Extraction of references using layout and formatting information from scientific articles

D-Lib Magazine (2013) 19(9-10)

DOI: 10.1045/september2013-kern

13Citations

25Readers

Get full text

Abstract

The automatic extraction of reference meta-data is an important requirement for the efficient management of collections of scientific literature. An existing powerful state-of-the-art system for extracting references from a scientific article is ParsCit; however, it requires the input document to be converted into plain text, thereby ignoring most of the formatting and layout information. In this paper, we quantify the contribution of this additional information to the reference extraction performance by an improved preprocessing using the information contained in PDF files and retraining sequence classifiers on an enhanced feature set. We found that the detection of columns, reading order, and decorations, as well as the inclusion of layout information improves the retrieval of reference strings, and the classification of reference token types can be improved using additional font information. These results emphasize the importance of layout and formatting information for the extraction of meta-data from scientific articles.

Author supplied keywords

Cite

CITATION STYLE

APA

Kern, R., & Klampfl, S. (2013). Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine, 19(9–10). https://doi.org/10.1045/september2013-kern

Extraction of references using layout and formatting information from scientific articles

Abstract

Author supplied keywords

Cite

Register to see more suggestions