Application of NLP for Information Extraction from Unstructured Documents

Shushanta Pudasaini; Subarna Shakya; Sagar Lamichhane; Sajjan Adhikari; Aakash Tamang; Sujan Adhikari

Conference Proceedings

Application of NLP for Information Extraction from Unstructured Documents

Lecture Notes in Networks and Systems (2022) 209 695-704

DOI: 10.1007/978-981-16-2126-0_54

8Citations

27Readers

Get full text

Abstract

The world is intrigued by data. In fact, huge capitals are invested to devise means that implements statistics and extract analytics from these sources. However, when we examine the studies performed on applicant tracking systems that retrieve valuable information from candidates’ CVs and job descriptions, they are mostly rule-based and hardly manage to employ contemporary techniques. Even though these documents vary in contents, the structure is almost identical. Accordingly, in this paper, we implement an NLP pipeline for the extraction of such structured information from a wide variety of textual documents. As a reference, textual documents which are used in applicant tracking systems like CV (Curriculum Vitae) and job vacancy information have been considered. The proposed NLP pipeline is built with several NLP techniques like document classification, document segmentation and text extraction. Initially for the classification of textual documents, support vector machines (SVM) and XGBoost algorithms have been implemented. Different segments of the identified document are categorized using NLP techniques such as chunking, regex matching and POS tagging. Relevant information from every segment is further extracted using techniques like Named Entity Recognition (NER), regex matching and pool parsing. Extraction of such structured information from textual documents can help to gain insights and use those insights in document maintenance, document scoring, matching and auto-filling forms.

Author supplied keywords

Cite

CITATION STYLE

APA

Pudasaini, S., Shakya, S., Lamichhane, S., Adhikari, S., Tamang, A., & Adhikari, S. (2022). Application of NLP for Information Extraction from Unstructured Documents. In Lecture Notes in Networks and Systems (Vol. 209, pp. 695–704). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-16-2126-0_54

Application of NLP for Information Extraction from Unstructured Documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions