OCR means optical character recognition, which is a text extraction technology that works with photos, scanned data, and PDF documents. By extracting text data, OCR systems typically convert non-editable, non-searchable documents into editable, searchable files. As a result, information finding and identification from digitized files is simplified. R bindings are provided by the Tesseract package. Tesseract is a strong optical character recognition (OCR) engine with over 100 languages supported. The engine is highly customizable, allowing you to fine-tune the detection algorithms to achieve the best possible results. With the help of Tesseract OCR technology, a method for extracting texts from photos was created. Any image can be used as input for the proposed OCR system, which converts it into a searchable text document. Furthermore, this system can search for words within the generated text and display the Bengali meaning terms. It finds the words and lines first, then identifies the words, then the static character classifier classifies the character, then does analysis, and finally an adaptive classifier. It is a framework which also includes a natural language processing approach for classifying commonly used terms with Bangla meanings from the output text, in addition to OCR.
CITATION STYLE
Chakraborty, P., Rakib Mia, M., Sumon, H. K., Sarker, A., Imtiaz, A., Mahbubur Rahman, M., … Choudhury, T. (2022). Recognize Meaningful Words and Idioms from the Images Based on OCR Tesseract Engine and NLTK. In Lecture Notes in Electrical Engineering (Vol. 888, pp. 297–310). Springer Science and Business Media Deutschland GmbH. https://doi.org/10.1007/978-981-19-1520-8_23
Mendeley helps you to discover research relevant for your work.