Abstract
Information extraction from the PDF sources is a tedious task. Most of the existing approaches use either tag-based format such as HTML and XML, or Plain-text format for the extraction of information. In this paper, we present an information extraction technique for research papers which exploits both XML and text formats intelligently. The various patterns and rules are prepared from integrated formats. Furthermore, the intelligent processing of XML and Plain-text for various situations compliments the approach to achieve high accuracy. The proposed approach is a heuristic based approach that extracts the information about logical structure and supportive materials of research papers.
Author supplied keywords
Cite
CITATION STYLE
Ahmad, R., Afzal, M. T., & Qadir, M. A. (2016). Information extraction from PDF sources based on rule-based system using integrated formats. In Communications in Computer and Information Science (Vol. 641, pp. 293–308). Springer Verlag. https://doi.org/10.1007/978-3-319-46565-4_23
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.