Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Partha Chakraborty; Md Rakib Mia; Humayun Kabir Sumon; Aditi Sarker; Al Imtiaz; Md Mahbubur Rahman; Mohammad Abu Yousuf; Tanupriya Choudhury

Book Chapter

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Springer Science and Business Media Deutschland GmbH, (2022), 297-310

DOI: 10.1007/978-981-19-1520-8_23

0Citations

6Readers

Get full text

Abstract

OCR means optical character recognition, which is a text extraction technology that works with photos, scanned data, and PDF documents. By extracting text data, OCR systems typically convert non-editable, non-searchable documents into editable, searchable files. As a result, information finding and identification from digitized files is simplified. R bindings are provided by the Tesseract package. Tesseract is a strong optical character recognition (OCR) engine with over 100 languages supported. The engine is highly customizable, allowing you to fine-tune the detection algorithms to achieve the best possible results. With the help of Tesseract OCR technology, a method for extracting texts from photos was created. Any image can be used as input for the proposed OCR system, which converts it into a searchable text document. Furthermore, this system can search for words within the generated text and display the Bengali meaning terms. It finds the words and lines first, then identifies the words, then the static character classifier classifies the character, then does analysis, and finally an adaptive classifier. It is a framework which also includes a natural language processing approach for classifying commonly used terms with Bangla meanings from the output text, in addition to OCR.

Author supplied keywords

Cite

CITATION STYLE

APA

Chakraborty, P., Rakib Mia, M., Sumon, H. K., Sarker, A., Imtiaz, A., Mahbubur Rahman, M., … Choudhury, T. (2022). Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK. In Lecture Notes in Electrical Engineering (Vol. 888, pp. 297–310). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-19-1520-8_23

Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK

Abstract

Author supplied keywords

Cite

Register to see more suggestions