Automatically identify and label sections in scientific journals using conditional random fields

3Citations
Citations of this article
2Readers
Mendeley users who have this article in their library.
Get full text

Abstract

In this paper, we describe a pipeline that automatically converts a journal article in the PDF format to an XML which conforms to NLM JATS DTD. First, the text and typographical features are extracted from the document using character level information. Then, we use a trickle down multi-level conditional random fields based classifier where at each level the pre-trained CRF model classifies a given line of text into one of the tags of DTD at a particular depth and feeds the resulting tag into the next level model as a feature. After identifying tags upto level three, we make use of separate supervised models for parsing authors, affiliations, references and citations. We employ heuristic based methods for matching affiliation to authors, and citation to references. The JATS XML thus generated, is converted into an RDF document. SPARQL queries are run on the RDF, to address the queries of Task 2 of the Semantic Publishing Challenge.

Cite

CITATION STYLE

APA

Ramesh, S. H., Dhar, A., Kumar, R. R., Anjaly, V., Sarath, K. S., Pearce, J., & Sundaresan, K. R. (2016). Automatically identify and label sections in scientific journals using conditional random fields. In Communications in Computer and Information Science (Vol. 641, pp. 269–280). Springer Verlag. https://doi.org/10.1007/978-3-319-46565-4_21

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free