Extraction of references using layout and formatting information from scientific articles

13Citations
Citations of this article
25Readers
Mendeley users who have this article in their library.
Get full text

Abstract

The automatic extraction of reference meta-data is an important requirement for the efficient management of collections of scientific literature. An existing powerful state-of-the-art system for extracting references from a scientific article is ParsCit; however, it requires the input document to be converted into plain text, thereby ignoring most of the formatting and layout information. In this paper, we quantify the contribution of this additional information to the reference extraction performance by an improved preprocessing using the information contained in PDF files and retraining sequence classifiers on an enhanced feature set. We found that the detection of columns, reading order, and decorations, as well as the inclusion of layout information improves the retrieval of reference strings, and the classification of reference token types can be improved using additional font information. These results emphasize the importance of layout and formatting information for the extraction of meta-data from scientific articles.

Cite

CITATION STYLE

APA

Kern, R., & Klampfl, S. (2013). Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine, 19(9–10). https://doi.org/10.1045/september2013-kern

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free