Design and Validation of the Novel Perso-Arabic Database in Shahmukhi Punjabi Script

Humera Rafique; Tariq Javid

Journal ArticleOPEN ACCESS

Design and Validation of the Novel Perso-Arabic Database in Shahmukhi Punjabi Script

Data Intelligence (2025) 7(3) 745-775

DOI: 10.3724/2096-7004.di.2025.0035

0Citations

7Readers

Abstract

This paper presents the design and validation of a novel dataset (SMHaroof) for printed alphabets in the Shamukhi script of Punjabi, a context-specific language of the Perso-Arabic family. The dataset is a novel addition to computational linguistics, artificial intelligence, pattern recognition, and optical character recognition research work. The dataset with subcategories, variants, and versions is publicly available for non-commercial research and academic use. The SMHaroof dataset is the first of its kind, designed in multiple categories of isolated context-specific forms of the characters in two different fonts, Nasta’leeq and Nask. It is available in grayscale, bitonal, and RGB versions, comprising 66728 (56744 + 9984) images. Multiple artificial neural networks (ANNs) and machine learning techniques were used to validate the dataset. A computer program has been developed to automatically generate the dataset with a user control data augmentation feature. The dataset auto-generation procedure described in this research is universal and applicable to other language scripts. The validation results range from 74% to 92% with different techniques.

Author supplied keywords

Cite

CITATION STYLE

APA

Rafique, H., & Javid, T. (2025). Design and Validation of the Novel Perso-Arabic Database in Shahmukhi Punjabi Script. Data Intelligence, 7(3), 745–775. https://doi.org/10.3724/2096-7004.di.2025.0035

Design and Validation of the Novel Perso-Arabic Database in Shahmukhi Punjabi Script

Abstract

Author supplied keywords

Cite

Register to see more suggestions