Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).
CITATION STYLE
Obaidullah, S. M., Santosh, K. C., Halder, C., Das, N., & Roy, K. (2017). Word-level thirteen official indic languages database for script identification in multi-script documents. In Communications in Computer and Information Science (Vol. 709, pp. 16–27). Springer Verlag. https://doi.org/10.1007/978-981-10-4859-3_2
Mendeley helps you to discover research relevant for your work.