We present a study on modelling American Sign Language (ASL) with encoder-only transformers and human pose estimation keypoint data. Using an enhanced version of the publicly available Word-level ASL (WLASL) dataset, and a novel normalisation technique based on signer body size, we show the impact model architecture has on accurately classifying sets of 10, 50, 100, and 300 isolated, dynamic signs using two-dimensional keypoint coordinates only. We demonstrate the importance of running and reporting results from repeated experiments to describe and evaluate model performance. We include descriptions of the algorithms used to normalise the data and generate the train, validation, and test data splits. We report top-1, top-5, and top-10 accuracy results, evaluated with two separate model checkpoint metrics based on validation accuracy and loss. We find models with fewer than 100k learnable parameters can achieve high accuracy on reduced vocabulary datasets, paving the way for lightweight consumer hardware to perform tasks that are traditionally resource-intensive, requiring expensive, high-end equipment. We achieve top-1, top-5, and top-10 accuracies of (Formula presented.), (Formula presented.), and (Formula presented.), respectively, on a vocabulary size of 10 signs; (Formula presented.), (Formula presented.), and (Formula presented.) on 50 signs; (Formula presented.), (Formula presented.), and (Formula presented.) on 100 signs; and (Formula presented.), (Formula presented.), and (Formula presented.) on 300 signs, thereby setting a new benchmark for this task.
CITATION STYLE
Woods, L. T., & Rana, Z. A. (2023). Modelling Sign Language with Encoder-Only Transformers and Human Pose Estimation Keypoint Data. Mathematics, 11(9). https://doi.org/10.3390/math11092129
Mendeley helps you to discover research relevant for your work.