Abstract
Data for sign language research is often difficult and costly to acquire. We therefore present a novel pipeline able to generate motion three-dimensional (3D) skeleton data from single-camera sign language videos only. First, three recurrent neural networks are learned to infer the three-dimensional position data of body, face, and finger joints for a high resolution of the signer's skeleton. Subsequently, the angular displacements of all joints over time are estimated using inverse kinematics and mapped to a virtual sign avatar for animation. Last, the generated data are evaluated in detail, including a sign language recognition and sign language synthesis scenario. Utilizing a neural word classifier trained on real motion capture data, we reliably classify word segments built from our newly generated position data with similar accuracy as motion capture data (absolute difference 3.8%). Furthermore, qualitative evaluation of sign animations shows that the avatar performs natural movements that are comprehensible and resemble animations created with original motion capture data.
Author supplied keywords
Cite
CITATION STYLE
Brock, H., Law, F., Nakadai, K., & Nagashima, Y. (2020). Learning Three-dimensional Skeleton Data from Sign Language Video. ACM Transactions on Intelligent Systems and Technology, 11(3). https://doi.org/10.1145/3377552
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.