Monocular 3D Pose Estimation and Tracking by Detection
Automatic recovery of 3D human pose from monocular image sequences is a challenging and important research topic with numerous applications. Although current meth- ods are able to recover 3D pose for a single person in con- trolled environments, they are severely challenged by real- world scenarios, such as crowded street scenes. To address this problem, we propose a three-stage process building on a number of recent advances. The first stage obtains an ini- tial estimate of the 2D articulation and viewpoint of the per- son from single frames. The second stage allows early data association across frames based on tracking-by-detection. These two stages successfully accumulate the available 2D image evidence into robust estimates of 2D limb positions over short image sequences (= tracklets). The third and final stage uses those tracklet-based estimates as robust im- age observations to reliably recover 3D pose. We demon- strate state-of-the-art performance on the HumanEva II benchmark, and also show the applicability of our approach to articulated 3D tracking in realistic street conditions.