Document Image Analysis Using Imagemagick and Tesseract-ocr

  • M L P
  • P J D
  • D N S
N/ACitations
Citations of this article
29Readers
Mendeley users who have this article in their library.

Abstract

Document image analysis is the field of converting paper documents into an editable electronic representation by performing optical character recognition (OCR). In recent years, there has been a tremendous amount of progress in the development of open source OCR systems. The tesseract-ocr engine, as was the HP Research Prototype in the UNLV Fourth Annual Test of OCR Accuracy, is described in a comprehensive overview. Emphasis is placed on aspects that are novel or at least unusual in an OCR engine, including in particular the line finding, features/classification methods, and the adaptive classifier. OCRopus is one of the leading open source document analysis systems using tesseract-ocr with a modular and pluggable architecture. Imagemagick is an open source image processing tool. This paper presents an overview of different steps involved in a document image analysis system and illustrates them with examples from Combination of imagemagick and OCRopus.

Cite

CITATION STYLE

APA

M L, Prof. S., P J, Dr. A., & D N, S. (2016). Document Image Analysis Using Imagemagick and Tesseract-ocr. IARJSET, 3(5), 108–112. https://doi.org/10.17148/iarjset.2016.3523

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free