In large-scale digitization processes, several common tasks are performed to provide an electronic version of a paper document. One of the first steps is the thresholding of the image, which is necessary for the following procedures to work properly. Many binarization methods have been proposed to solve this problem, but they need to be tuned on the target document corpus to obtain best results. In this paper, we introduce a full automatic thresholding method for printed document analysis. The purpose is to obtain the most suitable binarizer for a given document image according to the quality of the output of an OCR system. Tuning can be done either on a full page or on sample text-lines extracted from a page image. As opposed to existing methods, the tuning is directly goal-directed and does neither depend on subjective visual evaluation nor on non-representative performance criteria. We demonstrate the effectiveness of this approach on a subset of 740 pages from the Google 1000 Books dataset. Results show, that by choosing the right binarizer parameters with the Recognition Driven Thresholding (RDT) method the words-in-dictionary error rate of an OCR system can be reduced by 6%.
CITATION STYLE
Rangoni, Y., Shafait, F., & Breuel, T. M. (2009). OCR based thresholding. In Proceedings of the 11th IAPR Conference on Machine Vision Applications, MVA 2009 (pp. 98–101).
Mendeley helps you to discover research relevant for your work.