HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation

52Citations
Citations of this article
45Readers
Mendeley users who have this article in their library.
Get full text

Abstract

As we use our hands frequently in daily activities, the analysis of hand-object interactions plays a critical role to many multimedia understanding and interaction applications. Different from conventional 3D hand-only and object-only pose estimation, estimating 3D hand-object pose is more challenging due to the mutual occlusions between hand and object, as well as the physical constraints between them. To overcome these issues, we propose to fully utilize the structural correlations among hand joints and object corners in order to obtain more reliable poses. Our work is inspired by structured output learning models in sequence transduction field like Transformer encoder-decoder framework. Besides modeling inherent dependencies from extracted 2D hand-object pose, our proposed Hand-Object Transformer Network (HOT-Net) also captures the structural correlations among 3D hand joints and object corners. Similar to Transformer's autoregressive decoder, by considering structured output patterns, this helps better constrain the output space and leads to more robust pose estimation. However, different from Transformer's sequential modeling mechanism, HOT-Net adopts a novel non-autoregressive decoding strategy for 3D hand-object pose estimation. Specifically, our model removes the Transformer's dependence on previously generated results and explicitly feeds a reference 3D hand-object pose into the decoding process to provide equivalent target pose patterns for parallely localizing each 3D keypoint. To further improve physical validity of estimated hand pose, besides anatomical constraints, we propose a cooperative pose constraint, aiming to enable the hand pose to cooperate with hand shape, to generate hand mesh. We demonstrate real-time speed and state-of-the-art performance on benchmark hand-object datasets for both 3D hand and object poses.

Cite

CITATION STYLE

APA

Huang, L., Tan, J., Meng, J., Liu, J., & Yuan, J. (2020). HOT-Net: Non-Autoregressive Transformer for 3D Hand-Object Pose Estimation. In MM 2020 - Proceedings of the 28th ACM International Conference on Multimedia (pp. 3136–3145). Association for Computing Machinery, Inc. https://doi.org/10.1145/3394171.3413775

Register to see more suggestions

Mendeley helps you to discover research relevant for your work.

Already have an account?

Save time finding and organizing research with Mendeley

Sign up for free