Text detection in document images by machine learning algorithms

Darko Zelenika; Janez Povh; Bernard Ženko

Conference Proceedings

Text detection in document images by machine learning algorithms

Advances in Intelligent Systems and Computing (2016) 403 169-179

DOI: 10.1007/978-3-319-26227-7_16

4Citations

7Readers

Get full text

Abstract

In the proposed paper,we consider a problem of text detection in document images. This problem plays an important role in OCR systems and is a challenging task. In the first step of our proposed text detection approach, we use a self-adjusting bottom-up segmentation algorithm to segment a document image into a set of connected components (CCs). The segmentation algorithm is based on the Sobel edge detection method. In the second step, CCs are described in terms of 27 features and a machine learning algorithm is then used to classify the CCs as text or nontext. For testing the approach, we have collected a dataset (ASTRoID), which contains 500 images of text blocks and 500 images of nontext blocks. We empirically compare performance of the proposed text detection method when using seven different machine learning algorithms.

Author supplied keywords

Cite

CITATION STYLE

APA

Zelenika, D., Povh, J., & Ženko, B. (2016). Text detection in document images by machine learning algorithms. In Advances in Intelligent Systems and Computing (Vol. 403, pp. 169–179). Springer Verlag. https://doi.org/10.1007/978-3-319-26227-7_16

Text detection in document images by machine learning algorithms

Abstract

Author supplied keywords

Cite

Register to see more suggestions