Cross-attention multi branch for Vietnamese sign language recognition: CrossViViT

1Citations
Citations of this article
32Readers
Mendeley users who have this article in their library.

This article is free to access.

Abstract

Sign language serves as the primary communication medium for individuals who are deaf or hard of hearing. Despite its critical importance, barriers persist in communication between the deaf community and the broader society, primarily due to limited sign language proficiency among the general population. While automated sign language recognition (ASLR) systems leveraging machine learning technologies offer a promising solution, existing approaches face challenges in optimizing the trade-off between computational efficiency and recognition accuracy. This study presents CrossViViT, a novel architecture that integrates cross-attention mechanisms with video vision Transformer networks to address these limitations. Drawing inspiration from multi-branch network architectures that combine diverse feature perspectives for flexible image recognition, our approach achieves both computational efficiency and high accuracy. The proposed model demonstrates exceptional performance on the Vietnamese Sign Language (VSL) dataset, achieving 92.47% accuracy in recognizing 50 distinct gestures across 8510 videos while maintaining computational efficiency at approximately 629 FLOPS.

Cite

CITATION STYLE

APA

Chu, M. H., Nguyen, H. D., Nguyen, T. N. A., & Vu, H. N. (2025). Cross-attention multi branch for Vietnamese sign language recognition: CrossViViT. Discover Computing, 28(1). https://doi.org/10.1007/s10791-025-09669-0

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free