Integration of text information and graphic composite for pdf document analysis

Canhui Xu; Zhi Tang; Xin Tao; Cao Shi

Conference Proceedings

Integration of text information and graphic composite for pdf document analysis

Communications in Computer and Information Science (2012) 333 CCIS 13-22

DOI: 10.1007/978-3-642-34456-5_2

0Citations

7Readers

Get full text

Abstract

The trend of large scale digitization has greatly motivated the research on the processing of the PDF documents with little structure information. Challenging problems like graphic segmentation integrating with texts remain unsolved for successful practical application of PDF layout analysis. To cope with PDF documents, a hybrid method incorporating text information and graphic composite is proposed to segment the pages that are difficult to handle by traditional methods. Specifically, the text information is derived accurately from born-digital documents embedded with low-level structure elements in explicit form. Then page text elements are clustered by applying graph based method according to proximity and feature similarity. Meanwhile, the graphic components are extracted by means of texture and morphological analysis. By integrating the clustered text elements with image based graphic components, the graphics are segmented for layout analysis. The experimental results on pages of PDF books have shown satisfactory performance. © 2012 Springer-Verlag.

Author supplied keywords

Cite

CITATION STYLE

APA

Xu, C., Tang, Z., Tao, X., & Shi, C. (2012). Integration of text information and graphic composite for pdf document analysis. In Communications in Computer and Information Science (Vol. 333 CCIS, pp. 13–22). https://doi.org/10.1007/978-3-642-34456-5_2

Integration of text information and graphic composite for pdf document analysis

Abstract

Author supplied keywords

Cite

Register to see more suggestions