Recognizing human activities from unknown views is a challenging problem since human shapes appear quite differently from different viewpoints. In this paper, we learn a View-Invariant Pose (VIP) feature for depth-based cross-view action recognition. The proposed VIP feature encoder is a deep convolutional neural network that transfers human poses from multiple viewpoints to a shared high-level feature space. Learning such a deep model requires a large corpus of multi-view paired data which is very expensive to collect. Therefore, we generate a synthetic dataset by fitting human physical models with real motion capture data in the simulators and rendering depth images from various viewpoints. The VIP feature is learned from the synthetic data in an unsupervised way. To ensure the transferability from synthetic data to real data, domain adaptation is employed to minimize the domain difference. Moreover, an action can be considered as a sequence of poses and the temporal progress is modeled by recurrent neural network. In the experiments, our method is applied on two benchmark datasets of multi-view 3D human action and has been shown to achieve promising results when compared with the state-of-the-arts.
CITATION STYLE
Yang, Y. H., Liu, A. S., Liu, Y. H., Yeh, T. H., Li, Z. J., & Fu, L. C. (2019). Cross-View Action Recognition Using View-Invariant Pose Feature Learned from Synthetic Data with Domain Adaptation. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (Vol. 11362 LNCS, pp. 431–446). Springer Verlag. https://doi.org/10.1007/978-3-030-20890-5_28
Mendeley helps you to discover research relevant for your work.