BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization

36Citations
Citations of this article
23Readers
Mendeley users who have this article in their library.

Abstract

In this work, we are dedicated to leveraging the BERT pretraining success and modeling the domain-specific statistics to fertilize the sign language recognition (SLR) model. Considering the dominance of hand and body in sign language expression, we organize them as pose triplet units and feed them into the Transformer backbone in a frame-wise manner. Pre-training is performed via reconstructing the masked triplet unit from the corrupted input sequence, which learns the hierarchical correlation context cues among internal and external triplet units. Notably, different from the highly semantic word token in BERT, the pose unit is a low-level signal originally located in continuous space, which prevents the direct adoption of the BERT cross-entropy objective. To this end, we bridge this semantic gap via coupling tokenization of the triplet unit. It adaptively extracts the discrete pseudo label from the pose triplet unit, which represents the semantic gesture/body state. After pre-training, we fine-tune the pretrained encoder on the downstream SLR task, jointly with the newly added task-specific layer. Extensive experiments are conducted to validate the effectiveness of our proposed method, achieving new state-of-the-art performance on all four benchmarks with a notable gain.

Cite

CITATION STYLE

APA

Zhao, W., Hu, H., Zhou, W., Shi, J., & Li, H. (2023). BEST: BERT Pre-training for Sign Language Recognition with Coupling Tokenization. In Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023 (Vol. 37, pp. 3597–3605). AAAI Press. https://doi.org/10.1609/aaai.v37i3.25470

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free