Automatically identify and label sections in scientific journals using conditional random fields

Sree Harsha Ramesh; Arnab Dhar; Raveena R. Kumar; V. Anjaly; K. S. Sarath; Jason Pearce; Krishna R. Sundaresan

Conference Proceedings

Automatically identify and label sections in scientific journals using conditional random fields

Communications in Computer and Information Science (2016) 641 269-280

DOI: 10.1007/978-3-319-46565-4_21

3Citations

2Readers

Get full text

Abstract

In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.

Author supplied keywords

Cite

CITATION STYLE

APA

Ramesh, S. H., Dhar, A., Kumar, R. R., Anjaly, V., Sarath, K. S., Pearce, J., & Sundaresan, K. R. (2016). Automatically identify and label sections in scientific journals using conditional random fields. In Communications in Computer and Information Science (Vol. 641, pp. 269–280). Springer Verlag. https://doi.org/10.1007/978-3-319-46565-4_21

Automatically identify and label sections in scientific journals using conditional random fields

Abstract

Author supplied keywords

Cite

Register to see more suggestions