Layout and content extraction for PDF documents

Hui Chao; Jian Fan

Journal ArticleOPEN ACCESS

Layout and content extraction for PDF documents

Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (2004) 3163 213-224

DOI: 10.1007/978-3-540-28640-0_20

62Citations

63Readers

Abstract

Portable document format (PDF) is a common output format for electronic documents. Most PDF documents are untagged and do not have basic high-level document logical structural information, which makes the reuse or modification of the documents difficult. We developed techniques that identified logical components on a PDF document page. The outlines, style attributes and the contents of the logical components were extracted and expressed in an XML format. These techniques could facilitate the reuse and modification of the layout and the content of a PDF document page. © Springer-Verlag 2004.

Cite

CITATION STYLE

APA

Chao, H., & Fan, J. (2004). Layout and content extraction for PDF documents. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 3163, 213–224. https://doi.org/10.1007/978-3-540-28640-0_20

Layout and content extraction for PDF documents

Abstract

Cite

Register to see more suggestions