Word-level thirteen official indic languages database for script identification in multi-script documents

Sk Md Obaidullah; K. C. Santosh; Chayan Halder; Nibaran Das; Kaushik Roy

Conference Proceedings

Word-level thirteen official indic languages database for script identification in multi-script documents

Communications in Computer and Information Science (2017) 709 16-27

DOI: 10.1007/978-981-10-4859-3_2

3Citations

5Readers

Get full text

Abstract

Without a publicly available database, we cannot advance research nor can we make a fair comparison with the state-of-the-art methods. To bridge this gap, we present a database of eleven Indic scripts from thirteen official languages for the purpose of script identification in multi-script document images. Our database is composed of 39K words that are equally distributed (i.e., 3K words per language). At the same time, we also study three different pertinent features: spatial energy (SE), wavelet energy (WE) and the Radon transform (RT), including their possible combinations, by using three different classifiers: multilayer perceptron (MLP), fuzzy unordered rule induction algorithm (FURIA) and random forest (RF). In our test, using all features, MLP is found to be the best performer showing the bi-script accuracy of 99.24% (keeping Roman common), 98.38% (keeping Devanagari common) and tri-script accuracy of 98.19% (keeping both Devanagari and Roman common).

Author supplied keywords

Cite

CITATION STYLE

APA

Obaidullah, S. M., Santosh, K. C., Halder, C., Das, N., & Roy, K. (2017). Word-level thirteen official indic languages database for script identification in multi-script documents. In Communications in Computer and Information Science (Vol. 709, pp. 16–27). Springer Verlag. https://doi.org/10.1007/978-981-10-4859-3_2

Word-level thirteen official indic languages database for script identification in multi-script documents

Abstract

Author supplied keywords

Cite

Register to see more suggestions