Real-time human pose recognition ...
Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchio Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation Abstract We propose a new method to quickly and accurately pre- dict 3D positions of body joints from a single depth image, using no temporal information. We take an object recog- nition approach, designing an intermediate body parts rep- resentation that maps the difficult pose estimation problem into a simpler per-pixel classification problem. Our large and highly varied training dataset allows the classifier to estimate body parts invariant to pose, body shape, clothing, etc. Finally we generate confidence-scored 3D proposals of several body joints by reprojecting the classification result and finding local modes. The system runs at 200 frames per second on consumer hardware. Our evaluation shows high accuracy on both synthetic and real test sets, and investigates the effect of sev- eral training parameters. We achieve state of the art accu- racy in our comparison with related work and demonstrate improved generalization over exact whole-skeleton nearest neighbor matching. 1. Introduction Robust interactive human body tracking has applica- tions including gaming, human-computer interaction, secu- rity, telepresence, and even health-care. The task has re- cently been greatly simplified by the introduction of real- time depth cameras [16, 19, 44, 37, 28, 13]. However, even the best existing systems still exhibit limitations. In partic- ular, until the launch of Kinect [21], none ran at interactive rates on consumer hardware while handling a full range of human body shapes and sizes undergoing general body mo- tions. Some systems achieve high speeds by tracking from frame to frame but struggle to re-initialize quickly and so are not robust. In this paper, we focus on pose recognition in parts: detecting from a single depth image a small set of 3D position candidates for each skeletal joint. Our focus on per-frame initialization and recovery is designed to comple- ment any appropriate tracking algorithm [7, 39, 16, 42, 13] that might further incorporate temporal and kinematic co- herence. The algorithm presented here forms a core com- ponent of the Kinect gaming platform [21]. Illustrated in Fig. 1 and inspired by recent object recog- nition work that divides objects into parts (e.g. [12, 43]), our approach is driven by two key design goals: computa- tional efficiency and robustness. A single input depth image is segmented into a dense probabilistic body part labeling, with the parts defined to be spatially localized near skeletal depth image body parts 3D joint proposals Figure 1. Overview. From an single input depth image, a per-pixel body part distribution is inferred. (Colors indicate the most likely part labels at each pixel, and correspond in the joint proposals). Local modes of this signal are estimated to give high-quality pro- posals for the 3D locations of body joints, even for multiple users. joints of interest. Reprojecting the inferred parts into world space, we localize spatial modes of each part distribution and thus generate (possibly several) confidence-weighted proposals for the 3D locations of each skeletal joint. We treat the segmentation into body parts as a per-pixel classification task (no pairwise terms or CRF have proved necessary). Evaluating each pixel separately avoids a com- binatorial search over the different body joints, although within a single part there are of course still dramatic dif- ferences in the contextual appearance. For training data, we generate realistic synthetic depth images of humans of many shapes and sizes in highly varied poses sampled from a large motion capture database. We train a deep ran- domized decision forest classifier which avoids overfitting by using hundreds of thousands of training images. Sim- ple, discriminative depth comparison image features yield 3D translation invariance while maintaining high computa- tional efficiency. For further speed, the classifier can be run in parallel on each pixel on a GPU [34]. Finally, spatial modes of the inferred per-pixel distributions are computed using mean shift [10] resulting in the 3D joint proposals. An optimized implementation of our algorithm runs in under 5ms per frame (200 frames per second) on the Xbox 360 GPU, at least one order of magnitude faster than exist- ing approaches. It works frame-by-frame across dramati- cally differing body shapes and sizes, and the learned dis- criminative approach naturally handles self-occlusions and 1