This paper presents our research focusing on extracting referential heading-entries in recognized table of contents (TOC) pages. This task encounters two issues: the complexity of layouts (e.g., a referential heading-entry can have one or many lines, with “decorate” texts, etc.), and some text data errors caused by OCR processing in training data. Our approach uses several layout-based and content-based features to classify textual lines of TOC pages in datasets. Also, we propose synthesis rules to combine related and classified lines into identify referential heading-entries. The experiments are conducted on ICDAR Book Structure Extraction Datasets 2009, 2011, and 2013. The results of experiments show that proposed approach is more efficient than previous methods of referential heading-entries extraction.
CITATION STYLE
Nguyen, P. T., & Nguyen, D. T. (2015). Extraction of referential heading-entries in recognized table of contents pages. In Advances in Intelligent Systems and Computing (Vol. 348, pp. 1–9). Springer Verlag. https://doi.org/10.1007/978-3-319-18503-3_1
Mendeley helps you to discover research relevant for your work.