Abstract
Computer vision-oriented human pose estimation is focused on location of human skeleton in image or video,in which pose information can be used for pose estimation or a specific pose or action-objective location analysis in terms of the position relationship between the key areas of the human body. Nowadays,human pose estimation-oriented action recogni⁃ tion and pose tracking have been developing intensively. Conventional pose estimation methods can be segmented into two categories of object detection and pose estimation. The object detection analysis is based on segmentation,matching,or statistical learning,which is challenged for targets and backgrounds clarification in complex scenarios and it is still vulner⁃ able for prior information. Additionally,it is time-consuming and labor-intensive to construct training sample libraries and classifiers. The pose estimation analysis is in relevance to model-based or non-model-based methods,which is challenged for object detection-derived error extension and much more artificial constraint information. Nevertheless,its efficiency is still to be optimized farther. The emerging artificial intelligence(AI)based deep learning technique has its potentials for the recognition precision and speed of the deep learning-based human pose estimation methods to a certain extent. Gener⁃ ally,human pose estimation can be divided into two-dimensional and three-dimensional human pose estimation. For three-dimensional human pose estimation,two-dimensional human pose estimation model is beneficial for dealing with the crowd⁃ ing and occlusion situations. However,most network models are originated from convolutional neural network(CNN)mod⁃ els and it is challenged for depth-loaded network speed. Lightweight two-dimensional human pose estimation networks are concerned more for edge measurement deployment. We review the development process and optimization trend of the two-dimensional human pose estimation model based on deep learning literately. They can be divided into three categories:single-person pose estimation,multi-person pose estimation,and lightweight human pose estimation. Single-person pose estimation is the basis of multi-person pose estimation,which can be divided into methods based on keypoints regression and heatmap detection,and there is a trend to combine these two methods to achieve single-person pose estimation. Over⁃ all,multi-person pose estimation network model can be divided into top-down,bottom-up,and others. The precision of the top-down network model is higher,but the time efficiency is not satisfactory,especially for the crowded problem-related input data. The number of human bodies is larger in the input data,the estimation time is much more longer of network model. The precision of bottom-up network model has shrunk in small range,but the efficiency is greatly improved. More⁃ over,time consumption of network model is used and the human pose-estimated is independent of the number of human bodies in the input data. These two methods are actually as a dual method. Initially,to locate the position of the human body in the input data,top-down pose estimation method is focused on the body detector,and then pose estimation is per⁃ formed for each sample. Specifically,some top-down methods need to crop single-person body accurately and adjust it to the central position of the input data for each estimation. The bottom-up approach is oriented to get all body keypoints in the input data and these keypoints are assigned to the objects. At the same time,the appearance of single-stage network also means that researchers need to pay more attention to the computational cost of network model. A small number of net⁃ works have combined with top-down and bottom-up methods together,and it has achieved good results. We summarize mul⁃ tiple CNN models used in various human pose estimations,analyze the characteristics of various neural network models,and compare the performance of various pose estimation methods. It can be seen that the structural design of deep convolu⁃ tional neural network models is becoming more and more diverse,but various deep learning network models still have cer⁃ tain limitations when dealing with human pose estimation tasks. The technical methods adopted by the two-dimensional human pose estimation models and its existing problems are discussed,and possible future research directions are pre⁃ dicted. Our recommendation is aware to improve existing two-dimensional pose estimation network model for the pre-processing of input data on such aspects mentioned below:the clarity of the input data directly affects the pose estimation results,and effective image or video pre-processing methods may become a new idea to improve the precision and efficiency of pose estimation. The existing pose estimation methods are mostly via video data-cut static video frames. In essence,it is still restricted by image data pose estimation. Current real-time pose estimation of video data is essential for the applica⁃ tion of pose tracking and action recognition. Nowadays,a few methods have been proposed to combine deep learning based pose estimation method in related to time domain information,such as optical flow,pose flow and long short-term memory. The images involved in the actual application are still to be developed on the aspects of more crowded and more serious occlusion,so they are still to be resolved and optimized. Recent pose estimation network models are improved through lightweight methods. Lightweight methods have its potentials and it can be as one of the key directions for pose estimation.
Author supplied keywords
Cite
CITATION STYLE
Kong, Y., Qin, Y., & Zhang, K. (2023). Deep learning based two-dimension human pose estimation:a critical analysis. Journal of Image and Graphics, 28(7), 1965–1969. https://doi.org/10.11834/jig.220436
Register to see more suggestions
Mendeley helps you to discover research relevant for your work.