Word-level thirteen official indic languages database for script identification in multi-script documents

3Citations
Citations of this article
5Readers
Mendeley users who have this article in their library.
Get full text

Abstract

Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).

Cite

CITATION STYLE

APA

Obaidullah, S. M., Santosh, K. C., Halder, C., Das, N., & Roy, K. (2017). Word-level thirteen official indic languages database for script identification in multi-script documents. In Communications in Computer and Information Science (Vol. 709, pp. 16–27). Springer Verlag. https://doi.org/10.1007/978-981-10-4859-3_2

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free