Text segmentation for document recognition

Nicola Nobile; Ching Y. Suen

Abstract

Document segmentation is the process of dividing a document (handwritten or printed) into its base components (lines, words, characters). Once the zones (text and non-text) have been identified, the segmentation of the text elements can begin. Several challenges exist which need to be worked out in order to segment the elements correctly. For line segmentation, touching, broken, or overlapping text lines frequently occur. Handwritten documents have the additional challenge of curvilinear lines. Once a line has been segmented, it is processed to further segment it into characters. Similar problems of touching and broken elements exist for characters.An added level of complexity exists since documents have a degree of noise which can come from scanning, photocopying, or from physical damage. Historical documents have some amount of degradation to them. In addition, variation of typefaces, for printed text, and styles for handwritten text bring new difficulties for segmentation and recognition algorithms.This chapter contains descriptions of some methodologies, presented from recent research, that propose solutions that overcome these obstacles. Line segmentation solutions include horizontal projection, region growth techniques, probability density, and the level set method as possible, albeit partial, solutions. A method of angle stepping to detect angles for slanted lines is presented. Locating the boundaries of characters in historical, degraded ancient documents employs multi-level classifiers, and a level set active contour scheme as a possible solution. Mathematical expressions are generally more complex since the layout does not follow standard and typical text blocks. Lines can be composed of split sections (numerator and denominator), can have symbols spanning and overlapping other elements, and contain a higher concentration of superscript and subscript characters than regular text lines. Template matching is described as a partial solution to segment these characters.The methods described here apply to both printed and handwritten. They have been tested on Latin-based scripts as well as Arabic, Dari, Farsi, Pashto, and Urdu.

Author supplied keywords

Cite

CITATION STYLE

APA

Nobile, N., & Suen, C. Y. (2014). Text segmentation for document recognition. In Handbook of Document Image Processing and Recognition (pp. 257–290). Springer London. https://doi.org/10.1007/978-0-85729-859-1_8

Text segmentation for document recognition

Abstract

Author supplied keywords

Cite

Register to see more suggestions