Information extraction from PDF sources based on rule-based system using integrated formats

28Citations
Citations of this article
20Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.

Cite

CITATION STYLE

APA

Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information extraction from PDF sources based on rule-based system using integrated formats. In Communications in Computer and Information Science (Vol. 641, pp. 293–308). Springer Verlag. https://doi.org/10.1007/978-3-319-46565-4_23

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free