Recognition and classification of figures in PDF documents

Mingyan Shao; Robert P. Futrelle

Conference Proceedings

Recognition and classification of figures in PDF documents

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2006) 3926 LNCS 231-242

DOI: 10.1007/11767978_21

37Citations

29Readers

Get full text

Abstract

Graphics recognition for raster-based input discovers primitives such as lines, arrowheads, and circles. This paper focuses on graphics recognition of figures in vector-based PDF documents. The first stage consists of extracting the graphic and text primitives corresponding to figures. An interpreter was constructed to translate PDF content into a set of self-contained graphics and text objects (in Java), freed from the intricacies of the PDF file. The second stage consists of discovering simple graphics entities which we call graphemes, e.g., a pair of primitive graphic objects satisfying certain geometric constraints. The third stage uses machine learning to classify figures using grapheme statistics as attributes. A boosting-based learner (LogitBoost in the Weka toolkit) was able to achieve 100% classification accuracy in hold-out-one training/testing using 16 grapheme types extracted from 36 figures from BioMed Central journal research papers. The approach can readily be adapted to raster graphics recognition. © Springer-Verlag Berlin Heidelberg 2006.

Author supplied keywords

Cite

CITATION STYLE

APA

Shao, M., & Futrelle, R. P. (2006). Recognition and classification of figures in PDF documents. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 3926 LNCS, pp. 231–242). Springer Verlag. https://doi.org/10.1007/11767978_21

Recognition and classification of figures in PDF documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions