Information extraction from PDF sources based on rule-based system using integrated formats

Riaz Ahmad; Muhammad Tanvir Afzal; Muhammad Abdul Qadir

Conference Proceedings

Information extraction from PDF sources based on rule-based system using integrated formats

Communications in Computer and Information Science (2016) 641 293-308

DOI: 10.1007/978-3-319-46565-4_23

28Citations

20Readers

Get full text

Abstract

Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.

Author supplied keywords

Cite

CITATION STYLE

APA

Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information extraction from PDF sources based on rule-based system using integrated formats. In Communications in Computer and Information Science (Vol. 641, pp. 293–308). Springer Verlag. https://doi.org/10.1007/978-3-319-46565-4_23

Information extraction from PDF sources based on rule-based system using integrated formats

Abstract

Author supplied keywords

Cite

Register to see more suggestions