Abstract
This paper presents the design and validation of a novel dataset (SMHaroof) for printed alphabets in the Shamukhi script of Punjabi, a context-specific language of the Perso-Arabic family. The dataset is a novel addition to computational linguistics, artificial intelligence, pattern recognition, and optical character recognition research work. The dataset with subcategories, variants, and versions is publicly available for non-commercial research and academic use. The SMHaroof dataset is the first of its kind, designed in multiple categories of isolated context-specific forms of the characters in two different fonts, Nasta’leeq and Nask. It is available in grayscale, bitonal, and RGB versions, comprising 66728 (56744 + 9984) images. Multiple artificial neural networks (ANNs) and machine learning techniques were used to validate the dataset. A computer program has been developed to automatically generate the dataset with a user control data augmentation feature. The dataset auto-generation procedure described in this research is universal and applicable to other language scripts. The validation results range from 74% to 92% with different techniques.
Author supplied keywords
Cite
CITATION STYLE
Rafique, H., & Javid, T. (2025). Design and Validation of the Novel Perso-Arabic Database in Shahmukhi Punjabi Script. Data Intelligence, 7(3), 745–775. https://doi.org/10.3724/2096-7004.di.2025.0035
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.