This paper shows a novel approach of two-tier machine learning to locate bibliographic references in HTML and separate them into fields. First it is demonstrated, how Conditional Random Fields (CRFs) with constraints can be used to split bibliographic references into fields e.g. authors and title. Therefore a unique feature set, constraints and a method for automatic keyword extraction are introduced. The output of this CRF for tagging bibliographic references, Part Of Speech (POS) analysis and Named Entity Recognition (NER) build the first tier and their output is used to locate the bibliographic reference section in the first place. For this the documents are split into blocks, which are then used for classification. For this task a Support Vector Machines (SVM) approach is compared with another one using a CRF. We demonstrate this two-tier approach archives very good results, while the reference tagging approach is able to compete with other state-of-the-art approaches.
CITATION STYLE
Lindner, S. (2015). Two-tier machine learning using conditional random fields with constraints. In Communications in Computer and Information Science (Vol. 454, pp. 80–95). Springer Verlag. https://doi.org/10.1007/978-3-662-46549-3_6
Mendeley helps you to discover research relevant for your work.