Sign up & Download
Sign in

An unsupervised machine learning approach to body text and table of contents extraction from digital scientific articles

by Stefan Klampfl, Roman Kern
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) ()

Abstract

Scientific articles are predominantly stored in digital document formats, which are optimised for presentation, but lack structural information. This poses challenges to access the documents’ content, for example for information retrieval. We have developed a processing pipeline that makes use of unsupervised machine learning techniques and heuristics to detect the logical structure of a PDF document. Our system uses only information available from the current document and does not require any pre-trained model. Starting from a set of contiguous text blocks extracted from the PDF file, we first determine geometrical relations between these blocks. These relations, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this logical structure we finally extract the body text and the table of contents of a scientific article. We evaluate our pipeline on a number of datasets and compare it with state-of-the-art document structure analysis approaches.

Cite this document (BETA)

Authors on Mendeley

Readership Statistics

18 Readers on Mendeley
by Discipline
 
 
 
by Academic Status
 
28% Other Professional
 
28% Ph.D. Student
 
11% Student (Bachelor)
by Country
 
28% United Kingdom
 
6% Austria
 
6% United States

Sign up today - FREE

Mendeley saves you time finding and organizing research. Learn more

  • All your research in one place
  • Add and import papers easily
  • Access it anywhere, anytime

Start using Mendeley in seconds!

Already have an account? Sign in