Deep Learning-Based Lecture Posture Evaluation

Yifan YANG; Tao ZHANG; Weiyu LI

doi:10.1051/wujns/2024294315

All issues

Volume 29 / No 4 (August 2024)

Wuhan Univ. J. Nat. Sci., 29 4 (2024) 315-322

Full HTML

Open Access

Issue		Wuhan Univ. J. Nat. Sci. Volume 29, Number 4, August 2024


Page(s)		315 - 322
DOI		https://doi.org/10.1051/wujns/2024294315
Published online		04 September 2024

Wuhan University Journal of Natural Sciences, 2024, Vol.29 No.4, 315-322

Computer Science

CLC number: TP751

Deep Learning-Based Lecture Posture Evaluation

基于深度学习的讲课姿态评估

Yifan YANG (杨一凡)¹^,2, Tao ZHANG (张涛)²^† and Weiyu LI (李维钰)²

¹ Key Laboratory of Multidisciplinary Management and Control of Complex Systems of Anhui Higher Education Institutes, Anhui University of Technology, Maanshan 243002, Anhui, China
² School of Microelectronics and Data Science, Anhui University of Technology, Maanshan 243032, Anhui, China

^† Corresponding author. E-mail: zt9877@163.com

Received: 8 August 2023

Abstract

Computer vision, a scientific discipline enables machines to perceive visual information, aims to supplant human eyes in tasks encompassing object recognition, localization, and tracking. In traditional educational settings, instructors or evaluators evaluate teaching performance based on subjective judgment. However, with the continuous advancements in computer vision technology, it becomes increasingly crucial for computers to take on the role of judges in obtaining vital information and making unbiased evaluations. Against this backdrop, this paper proposes a deep learning-based approach for evaluating lecture posture. First, feature information is extracted from various dimensions, including head position, hand gestures, and body posture, using a human pose estimation algorithm. Second, a machine learning-based regression model is employed to predict machine scores by comparing the extracted features with expert-assigned human scores. The correlation between machine scores and human scores is investigated through experiment and analysis, revealing a robust overall correlation (0.642 0) between predicted machine scores and human scores. Under ideal scoring conditions (100 points), approximately 51.72% of predicted machine scores exhibited deviations within a range of 10 points, while around 81.87% displayed deviations within a range of 20 points; only a minimal percentage of 0.12% demonstrated deviations exceeding the threshold of 50 points. Finally, to further optimize performance, additional features related to bodily movements are extracted by introducing facial expression recognition and gesture recognition algorithms. The fusion of multiple models resulted in an overall average correlation improvement of 0.022 6.

摘要

计算机视觉作为一门如何让机器"看"的科学,旨在让计算机代替人类的眼睛来对目标进行识别、定位和跟踪等。在过去,评委老师通常根据自己的判断来评估老师的讲课水平。随着计算机视觉技术的发展,用计算机代替评委来获取关键信息并进行判断具有非常重要的意义。基于上述背景,本文提出了一种基于深度学习的讲课姿态评估方法。首先通过人体姿态估计算法从人体的头部、手部、躯干等多个维度上提取特征信息,然后利用机器学习的回归模型对提取的特征和人工分预测机器分,最后分析了机器分与人工分之间的相关度以及分数偏差。实验结果表明,机器预测分与人工分整体相关度达到0.642 0,表现出强相关的水平。在满分100分的情况下, 机器预测分和人工分的分差在10分以内和20分以内的占比分别达到51.72%和81.87%,分差50分以上的情况下仅占0.12%。为了优化效果,引入了人脸表情和手势识别算法来提取更多的人体特征,多模型融合下的整体相关度提升了0.022 6。

Key words: deep learning / human pose estimation / object detection / correlation

关键字 : 深度学习 / 姿态评估 / 对象检测 / 相关度

Cite this article: YANG Yifan, ZHANG Tao, LI Weiyu. Deep Learning-Based Lecture Posture Evaluation[J]. Wuhan Univ J of Nat Sci, 2024, 29(4): 315-322.

Biography: YANG Yifan, male, Master candidate, research direction: image processing. E-mail: 154232892@qq.com

Fundation item: Supported by the Open Fund of Key Laboratory of Anhui Higher Education Institutes (CS2021-07), the National Natural Science Foundation of China (61701004) and the Outstanding Young Talents Support Program of Anhui Province (gxyq2021178)

© Wuhan University 2024

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

In traditional teacher lecture evaluation studies, subjective judgments from evaluating teachers have predominantly been employed. This approach comes with limitations such as time-consuming real-time observation and substantial manpower investment. In the early phases of human posture estimation technology development, researchers primarily utilized images to identify key points in the human body. They used a method based on a tree structure model to connect the limbs of the human body, thereby obtaining information on the location of key points^[1-3]. This method has the advantage of fast determination of human pose. However, this method also has some limitations, such as heavy reliance on hand-designed features (e.g., SIFT features, HOG features, etc.). It is prone to produce erroneous estimation results in the case of occlusion.

In the past ten years, significant advancements have been made in deep learning by utilizing Convolutional Neural Networks (CNN). Numerous challenges pertaining to computer vision have been effectively tackled by employing deep learning methodologies. Since the breakthrough of AlexNet in the 2012 ImageNet competition, deep learning has experienced rapid advancements^[4]. The application of deep learning in computer vision is extensive, encompassing various domains such as image classification, object detection, face recognition, image generation, and other related areas. Tompson et al^[5]first used a heat map based on the convolutional neural network to regress the key points, using the structural relationship between the key points of the human body, combined with Markov random fields for modeling. The DeepPose method, initially introduced by Toshev et al^[6], pioneered the use of deep learning techniques for human pose estimation. It aims to acquire a mapping directly from the input image space to the Cartesian coordinate space of key points. A cascaded convolutional neural network is used to extract the features of the input picture and refine the prediction stage by stage. In 2016, the ResNet network emerged victorious in a computer vision competition for object detection and classification, while several networks incorporating multi-scale information fusion were proposed^[7]. Among these is Convolutional Pose Machines (CPM), a network introduced by Wei et al^[8], which achieves an expansive perceptual field by utilizing large convolution kernels, multi-stage convolution, and pooling to capture constraint information from other body parts. Multi-stage training and intermediate supervision ideas are proposed to prevent the gradient disappearance problem. In the same year, an immensely significant dataset, COCO, also emerged. The COCO dataset is one of the large-scale datasets used for object detection tasks, and this dataset contains more than 1.51 million objects^[9]. The Openpose network proposed by Cao et al^[10], incorporating CPM as a crucial component, emerged as the COCO key points detection competition champion that year. The network first detects all key points in the image and then models the human skeleton by the proposed Part Affinity Field (PAF). The design principle of PAF is to model a vector field between two neighboring key points.

In 2017, Chen et al^[11] proposed the Convolutional Pose Network (CPN), which is divided into two stages consisting of GlobalNet and RefineNet sub-networks. GlobalNet is based on the ResNet backbone design and is responsible for capturing global information as a whole, including global features and semantic information. In contrast, RefineNet starts from a low-level feature map while GlobalNet predicts key points with fine tuning to address the accuracy of key point localization.

In 2018, Li et al^[12] proposed the Multi-Stage Networks for Human Pose Estimation (MSPN), which employs a method of adjacent-stage feature aggregation to reduce the loss of feature information. At each stage, the network connects both feature maps of the corresponding resolution from the downsampling and upsampling process of the previous stage to fuse them with the feature maps of the downsampling of the current stage, which makes the information interaction more complete and reduces the training difficulty. In the same year, Xiao et al^[13] proposed simple baselines for human pose estimation and tracking, which is a simple but effective baseline network that uses transposed convolution to improve the resolution of the feature map. In 2019, Sun et al^[14] proposed HRNet, a network architecture that emphasizes the impact of spatial resolution on detection accuracy. Previous methods often increased the computational burden of the network to maintain rich feature map information and precise keypoint localization. However, HRNet adopts an innovative strategy by preserving the highest precision spatial information through multi-stage design. Li et al^[15] proposed another important dataset, CrowdPose, which fills the gap of the crowded environment in the human pose estimation dataset.

Based on the research as mentioned earlier, we proposed a deep learning-based approach for evaluating posture during lectures. Human features are extracted using a Deep Dual Consecutive Network for Human Pose Estimation (DCPose). However, due to the limited number of extracted features, the object detection algorithm is proposed to extract more features. We use YOLOv5 to recognize human hand gestures and FaceCNN+OpenCV for facial expression recognition to extract human hand and head expression features. With the fusion of multiple models, the effectiveness of the teacher's lecture posture evaluation will be further improved.

1 Dataset and Evaluation Criteria

1.1 Teachers’ Lecture Dataset

In evaluating a teacher's delivery standard, we typically employ diverse criteria encompassing the clarity and fluency of their presentation, ability to convey emotions effectively, and engagement with students during lectures. Judges can assign scores based on these aspects to provide an overall assessment of the speaker's performance. Consequently, we can observe elements such as the teacher's body language, posture, and movement to gain deeper insights into their presentation style and technique. To investigate this matter further, we collected video data featuring lectures delivered by numerous teachers from various professional backgrounds and age groups in different settings. The experiment dataset comes from university teaching classroom videos and speech competition videos. By analyzing this video data, we can enhance our understanding of a teacher's lecture performance and skills.

We also gathered ratings from numerous teachers who manually assessed the performance of several evaluative teachers. The evaluative teachers appraised teaching performance across 6 dimensions: overall relaxation, head relaxation, hand relaxation, torso relaxation, interaction, and hand expression. Table 1 presents selected manual scores for hand relaxation.

Table 1

Manual scores under hand relaxation

1.2 Evaluation Criteria

When evaluating a teacher's teaching level, there are multiple aspects to consider. This experiment primarily assesses the teacher's in-class performance by focusing on body posture. The evaluation criteria for this study are categorized into three areas: head, torso, and hands, as depicted in Fig. 1.

Fig. 1 Evaluation criteria for teachers' lecture body-gesture

2 Methodology

2.1 Feature Extraction Based on Human Pose Estimation

Human key points and human behavior are inextricably linked. Extracting the key points of the teacher's body is a prerequisite for judging the teacher's behavior, which is an essential basis for evaluating the level of the teacher's lectures. Therefore, human behavior is a necessary factor in measuring behavioral characteristics such as interaction and expressiveness of the teacher's lectures. In this paper, we utilize the open-source DCPose algorithm to extract teacher posture features, to detect key points from captured images, and calculate human body behavior features for these points as a basis for machine scoring. The process is outlined below:

(1) Extraction of key points from the human body

In the PoseTracking17 dataset for human pose estimation, each teacher instance is annotated with 17 key points representing distinct body parts or feature points, accompanied by corresponding category labels^[16], as depicted in Fig. 2(a). The DCPose algorithm predicts each teacher's human body key points and stores them in a JSON file, including label information, image pixel coordinates, and confidence levels of each key point. Additionally, visualization of the key points' skeleton connections was performed during the prediction process, as illustrated in Fig. 2(b).

Fig. 2 (a) PoseTracking17 dataset key points; (b) Skeleton diagram visualization results

(2) Key points calculation features

By employing the DCPose algorithm, we have successfully extracted salient human key points from a vast collection of sampled frames derived from numerous educational lectures delivered by teachers. Based on the extraction of key points, we extract features of human behavior using the information transformation between frames. The 13 features included are: 1) left-hand trajectory, 2) right-hand trajectories, 3) left wrist-elbow-shoulder angle, 4) right wrist-elbow-shoulder angle, 5) left elbow-shoulder-hip angle, 6) right elbow-shoulder-hip angle, 7) the distance between the two wrists, 8) number of changes in the distance between the two wrists, 9) difference between the changes in the distance between the two wrists, 10) sums of body forward tilt angles, 11) the sum of angles of left and right body rotation, 12) nose trajectory and 13) the angle between the left and right ears. The generated feature database is partially displayed in Table 2.

Consequently, it is crucial to calculate the correlation between each extracted feature and the manual rating. In statistical analysis, Pearson's correlation coefficient is employed to assess the degree of linear association between two variables, X and Y, ranging from -1 to 1. The formula for computing this coefficient is as follows:

$\begin{array}{l} ρ_{X, Y} = \frac{c o v (X, Y)}{σ_{X} σ_{Y}} = \frac{E ((X - μ_{X}) (X - μ_{Y}))}{σ_{X} σ_{Y}} \\ = \frac{E (X Y) - E (X) E (Y)}{\sqrt[]{E (X^{2}) - E^{2} (X)} \sqrt[]{E (Y^{2}) - E^{2} (Y)}} \end{array}$ (1)

where $c o v (X, Y)$ is called the covariance of $X$ , $Y$ , $σ_{X}$ is the standard deviation of $X$ , and $μ_{X}$ = $E (X)$ is the expectation of $X$ .

Figure 3 presents a heat map illustrating the correlation between each behavior characteristic in the teacher's lecture behavior database and the average score of 6 dimensions calculated manually. The horizontal coordinates correspond to the 13 features of human behavior and the vertical coordinates correspond to the 6 dimensions. One can observe that there is a high correlation between the teacher behavioral features extracted by the human posture estimation algorithm and the real scores given by the judges from the 6 dimensions.

Fig. 3 Feature and manual score correlation hot-map

Table 2

Features database

2.2 Feature Extraction Based on Object Detection

The theory of deep learning in vision is mainly based on CNN. The convolutional neural network is a hierarchical neural network that can extract features layer by layer from the original image. These features can represent visual information such as edges, colors, shapes, etc. Through multi-layer convolution and pooling operations, convolutional neural networks can gradually abstract higher-level visual features and ultimately achieve task classification, detection, and segmentation of images. For face detection, this paper utilizes OpenCV's own harr-cascade detector to recognize the head of the image. OpenCV (Open-source Computer Vision Library) is a cross-platform computer vision and machine learning software library which is lightweight and efficient and supports many algorithms related to computer vision and machine learning. For gesture recognition, this paper uses the YOLOv5s model proposed by the ustralytics team; the YOLOv5s model is an object detection model based on the YOLO (You Only Look Once) algorithm^[17], which is mainly used to recognize and localize objects in videos or images.

(1) Face-CNN+OpenCV to extract facial expression features

First, the built-in face detector in OpenCV is utilized for teacher's facial recognition. There are many feature classifiers in OpenCV; for example, the Harr feature in the OpenCV library has the face, the organs of the face, and the human body. The essence of this model is a classifier, also known as a cascade classifier, which we use to recognize faces, followed by applying a Face-CNN network for facial expression classification. The Fer2013 dataset, which comprises seven distinct expressions, is employed as the data source. During testing, the probability distribution of each teacher's 7 expressions is computed as an extracted feature of their facial expression. Figure 4 illustrates the architecture of the Face-CNN model. Conv, Pool, FC stand for convolutional layer, pooling layer, and fully connected layer, respectively.

Fig. 4 Face-CNN model

(2) YOLOv5 extracts gesture features

In the teacher's lecture, gestures are very important. It can not only help the teacher to express his ideas in the lecture process better, but also enhance the appeal and attraction of the lecture. With proper gestures, teachers can be more confident, lively, and engaging, while also reflecting their personality and style. With the vigorous promotion of object detection technology, gesture detection has also been added to object detection, which has attracted hot attention in academic and industrial fields. This paper uses the YOLOv5s pre-training model proposed by the ustralytics team to carry out the development. The existing YOLOv5s pre-training model is implemented for the COCO dataset. For this specific data set of gestures, we need to set a gesture data set by ourselves and use the pre-trained model to continue transfer learning under the gesture data set to realize gesture recognition and then realize the recognition of the teacher's gesture. The training dataset was derived from the HaGRID dataset, which we have condensed and reduced in resolution due to its voluminous size. It comprises 18 gesture categories, each containing 300 images for training and 30 images for testing, referred to as HaGRID-Light. Gesture categories include one, ok, four, three, call, etc. Figure 5 displays the results of model training and validation, where box_loss, obj_loss, and cls_loss represent box regression loss, object confidence loss, and classification loss, respectively. We trained the model using the open-source YOLOv5s network and set epoch equals to 100.

Fig. 5 The performance of YOLOv5s on the HaGRID-Light dataset regarding training and validation

By utilizing facial expression recognition and gesture recognition techniques, we extracted distinctive features from the frame data of numerous teachers. Features include the probability distribution of 7 common facial expressions and the frequency of no-hand gestures. The database of teacher features generated is shown in Table 3.

Table 3

Teacher database on facial expressions and gesture features

3 Experiments

3.1 Evaluation of Lecture Posture Based on Human Posture Estimation

After validating the efficacy of the DCPose algorithm in extracting human behavioral features, we employed machine learning regression algorithms for prediction. We analyzed the correlation and score deviation between predicted results and manual ratings. To ensure score prediction stability, this experiment utilized LR (Logistic Regression), SVM (Support Vector Machine), and CART (Classification and Regression Tree) regressors that were evaluated through cross-validation. The specific steps are as follows:

Step 1: Change the judges' full mark standard from 5 points to 100 points, and use the average score of the 3 judges' teachers as the final manual score;

Step 2: Using regressors, predictions are made for a sample of 563 teachers by taking the cross-validated machine scores as predictors;

Step 3: The correlation between machine scores and manual scores and the bias analysis is calculated, which is presented in Table 4. From Table 4, it can be seen that under a maximum score of 100 points, the proportion of machine-predicted scores within a difference of 10 points and within a difference of 20 points from human scores reached 51.72% and 81.87%, respectively. In cases where the score difference is above 50 points, it only accounted for 0.12%.

The analysis of results indicates a strong correlation between the predicted machine scores and manual scores, with an overall correlation level above 0.642. In particular, the human hand and human torso components largely exceeded the general level. However, the head relaxation and interaction dimensions were slightly lower in comparison. Therefore, the model will be further optimized at a later stage.

Table 4

Correlation and bias analysis of machine scores and manual scores based on human posture estimation

3.2 Evaluation of Lecture Posture Based on Multi-Model Fusion

For further optimization, an object detection algorithm is added to the extracted features to perform feature extraction on human facial expressions and gestures. The overall process architecture for the evaluation of speech contests under the use of multi-model fusion is shown in Fig. 6.

Fig. 6 Overall process architecture

The experiment demonstrates that the fusion of multiple feature information under a multi-model framework significantly enhances the predictive performance when employing a regressor. The correlation and deviation analysis between machine scores and human scores, based on the fusion of multiple models, are presented in Table 5.

With multi-model fusion, face expression and gesture recognition algorithms were introduced to extract more human features, and the overall correlation between the predicted machine scores and the manual scores was improved by 2.26%. The percentage deviation of less than 20 points increased by 1.54%.

Table 5

Correlation and bias analysis of machine scores and manual scores based on multi-model fusion

4 Conclusion

This paper proposes a deep learning-based approach for evaluating lecture posture. First, we employ a human pose estimation algorithm to extract key body points and calculate behavioral features based on these points. Then, the extracted features are utilized by a regressor to predict machine scores. Additionally, we introduce an object detection algorithm to extract facial expressions and gesture features. Experimental results demonstrate that multi-model fusion significantly improves prediction performance.

References

Zhang X Q, Li C C, Tong X F, et al. Efficient human pose estimation via parsing a tree structure based human model[C]//2009 IEEE 12th International Conference on Computer Vision. New York: IEEE, 2009: 1349-1356. [Google Scholar]
Sun M, Kohli P, Shotton J. Conditional regression forests for human pose estimation[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2012: 3394-3401. [CrossRef] [Google Scholar]
Dantone M, Gall J, Leistner C, et al. Human pose estimation using body parts dependent joint regressors[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2013: 3041-3048. [CrossRef] [Google Scholar]
Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2012, 60: 84-90. [Google Scholar]
Tompson J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[EB/OL]. [2023-05-20]. https://arxiv.org/pdf/1406.2984. [Google Scholar]
Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2014: 1653-1660. [CrossRef] [Google Scholar]
He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778. [Google Scholar]
Wei S H, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 4724-4732. [CrossRef] [Google Scholar]
Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2014: 740-755. [Google Scholar]
Cao Z, Simon T, Wei S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 1302-1310. [CrossRef] [Google Scholar]
Chen Y L, Wang Z C, Peng Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7103-7112. [CrossRef] [Google Scholar]
Li W B, Wang Z C, Yin B Y, et al. Rethinking on multi-stage networks for human pose estimation[EB/OL]. [2023-05-21]. http://arxiv.org/abs/1901.00148. [Google Scholar]
Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2018: 472-487. [Google Scholar]
Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 5686-5696. [CrossRef] [Google Scholar]
Li J F, Wang C, Zhu H, et al. CrowdPose: Efficient crowded scenes pose estimation and a new benchmark[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 10855-10864. [Google Scholar]
Andriluka M, Iqbal U, Insafutdinov E, et al. PoseTrack: A benchmark for human pose estimation and tracking[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5167-5176. [CrossRef] [Google Scholar]
Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 779-788. [CrossRef] [Google Scholar]

All Tables

Table 1

Manual scores under hand relaxation

Features database

Teacher database on facial expressions and gesture features

In the text

Table 4

Correlation and bias analysis of machine scores and manual scores based on human posture estimation

In the text

Table 5

Correlation and bias analysis of machine scores and manual scores based on multi-model fusion

In the text

All Figures

	Fig. 1 Evaluation criteria for teachers' lecture body-gesture
In the text

	Fig. 2 (a) PoseTracking17 dataset key points; (b) Skeleton diagram visualization results
In the text

	Fig. 3 Feature and manual score correlation hot-map
In the text

	Fig. 4 Face-CNN model
In the text

	Fig. 5 The performance of YOLOv5s on the HaGRID-Light dataset regarding training and validation
In the text

	Fig. 6 Overall process architecture
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Zhang X Q, Li C C, Tong X F, et al. Efficient human pose estimation via parsing a tree structure based human model[C]//2009 IEEE 12th International Conference on Computer Vision. New York: IEEE, 2009: 1349-1356. [Google Scholar]

[2] Sun M, Kohli P, Shotton J. Conditional regression forests for human pose estimation[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2012: 3394-3401. [CrossRef] [Google Scholar]

[3] Dantone M, Gall J, Leistner C, et al. Human pose estimation using body parts dependent joint regressors[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2013: 3041-3048. [CrossRef] [Google Scholar]

[4] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2012, 60: 84-90. [Google Scholar]

[5] Tompson J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[EB/OL]. [2023-05-20]. https://arxiv.org/pdf/1406.2984. [Google Scholar]

[6] Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2014: 1653-1660. [CrossRef] [Google Scholar]

[7] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778. [Google Scholar]

[8] Wei S H, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 4724-4732. [CrossRef] [Google Scholar]

[9] Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2014: 740-755. [Google Scholar]

[10] Cao Z, Simon T, Wei S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 1302-1310. [CrossRef] [Google Scholar]

[11] Chen Y L, Wang Z C, Peng Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7103-7112. [CrossRef] [Google Scholar]

[12] Li W B, Wang Z C, Yin B Y, et al. Rethinking on multi-stage networks for human pose estimation[EB/OL]. [2023-05-21]. http://arxiv.org/abs/1901.00148. [Google Scholar]

[13] Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2018: 472-487. [Google Scholar]

[14] Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 5686-5696. [CrossRef] [Google Scholar]

[15] Li J F, Wang C, Zhu H, et al. CrowdPose: Efficient crowded scenes pose estimation and a new benchmark[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 10855-10864. [Google Scholar]

[16] Andriluka M, Iqbal U, Insafutdinov E, et al. PoseTrack: A benchmark for human pose estimation and tracking[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5167-5176. [CrossRef] [Google Scholar]

[17] Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 779-788. [CrossRef] [Google Scholar]