Open Access
Issue
Wuhan Univ. J. Nat. Sci.
Volume 29, Number 4, August 2024
Page(s) 315 - 322
DOI https://doi.org/10.1051/wujns/2024294315
Published online 04 September 2024

© Wuhan University 2024

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

In traditional teacher lecture evaluation studies, subjective judgments from evaluating teachers have predominantly been employed. This approach comes with limitations such as time-consuming real-time observation and substantial manpower investment. In the early phases of human posture estimation technology development, researchers primarily utilized images to identify key points in the human body. They used a method based on a tree structure model to connect the limbs of the human body, thereby obtaining information on the location of key points[1-3]. This method has the advantage of fast determination of human pose. However, this method also has some limitations, such as heavy reliance on hand-designed features (e.g., SIFT features, HOG features, etc.). It is prone to produce erroneous estimation results in the case of occlusion.

In the past ten years, significant advancements have been made in deep learning by utilizing Convolutional Neural Networks (CNN). Numerous challenges pertaining to computer vision have been effectively tackled by employing deep learning methodologies. Since the breakthrough of AlexNet in the 2012 ImageNet competition, deep learning has experienced rapid advancements[4]. The application of deep learning in computer vision is extensive, encompassing various domains such as image classification, object detection, face recognition, image generation, and other related areas. Tompson et al[5] first used a heat map based on the convolutional neural network to regress the key points, using the structural relationship between the key points of the human body, combined with Markov random fields for modeling. The DeepPose method, initially introduced by Toshev et al[6], pioneered the use of deep learning techniques for human pose estimation. It aims to acquire a mapping directly from the input image space to the Cartesian coordinate space of key points. A cascaded convolutional neural network is used to extract the features of the input picture and refine the prediction stage by stage. In 2016, the ResNet network emerged victorious in a computer vision competition for object detection and classification, while several networks incorporating multi-scale information fusion were proposed[7]. Among these is Convolutional Pose Machines (CPM), a network introduced by Wei et al[8], which achieves an expansive perceptual field by utilizing large convolution kernels, multi-stage convolution, and pooling to capture constraint information from other body parts. Multi-stage training and intermediate supervision ideas are proposed to prevent the gradient disappearance problem. In the same year, an immensely significant dataset, COCO, also emerged. The COCO dataset is one of the large-scale datasets used for object detection tasks, and this dataset contains more than 1.51 million objects[9]. The Openpose network proposed by Cao et al[10], incorporating CPM as a crucial component, emerged as the COCO key points detection competition champion that year. The network first detects all key points in the image and then models the human skeleton by the proposed Part Affinity Field (PAF). The design principle of PAF is to model a vector field between two neighboring key points.

In 2017, Chen et al[11] proposed the Convolutional Pose Network (CPN), which is divided into two stages consisting of GlobalNet and RefineNet sub-networks. GlobalNet is based on the ResNet backbone design and is responsible for capturing global information as a whole, including global features and semantic information. In contrast, RefineNet starts from a low-level feature map while GlobalNet predicts key points with fine tuning to address the accuracy of key point localization.

In 2018, Li et al[12] proposed the Multi-Stage Networks for Human Pose Estimation (MSPN), which employs a method of adjacent-stage feature aggregation to reduce the loss of feature information. At each stage, the network connects both feature maps of the corresponding resolution from the downsampling and upsampling process of the previous stage to fuse them with the feature maps of the downsampling of the current stage, which makes the information interaction more complete and reduces the training difficulty. In the same year, Xiao et al[13] proposed simple baselines for human pose estimation and tracking, which is a simple but effective baseline network that uses transposed convolution to improve the resolution of the feature map. In 2019, Sun et al[14] proposed HRNet, a network architecture that emphasizes the impact of spatial resolution on detection accuracy. Previous methods often increased the computational burden of the network to maintain rich feature map information and precise keypoint localization. However, HRNet adopts an innovative strategy by preserving the highest precision spatial information through multi-stage design. Li et al[15] proposed another important dataset, CrowdPose, which fills the gap of the crowded environment in the human pose estimation dataset.

Based on the research as mentioned earlier, we proposed a deep learning-based approach for evaluating posture during lectures. Human features are extracted using a Deep Dual Consecutive Network for Human Pose Estimation (DCPose). However, due to the limited number of extracted features, the object detection algorithm is proposed to extract more features. We use YOLOv5 to recognize human hand gestures and FaceCNN+OpenCV for facial expression recognition to extract human hand and head expression features. With the fusion of multiple models, the effectiveness of the teacher's lecture posture evaluation will be further improved.

1 Dataset and Evaluation Criteria

1.1 Teachers’ Lecture Dataset

In evaluating a teacher's delivery standard, we typically employ diverse criteria encompassing the clarity and fluency of their presentation, ability to convey emotions effectively, and engagement with students during lectures. Judges can assign scores based on these aspects to provide an overall assessment of the speaker's performance. Consequently, we can observe elements such as the teacher's body language, posture, and movement to gain deeper insights into their presentation style and technique. To investigate this matter further, we collected video data featuring lectures delivered by numerous teachers from various professional backgrounds and age groups in different settings. The experiment dataset comes from university teaching classroom videos and speech competition videos. By analyzing this video data, we can enhance our understanding of a teacher's lecture performance and skills.

We also gathered ratings from numerous teachers who manually assessed the performance of several evaluative teachers. The evaluative teachers appraised teaching performance across 6 dimensions: overall relaxation, head relaxation, hand relaxation, torso relaxation, interaction, and hand expression. Table 1 presents selected manual scores for hand relaxation.

Table 1

Manual scores under hand relaxation

1.2 Evaluation Criteria

When evaluating a teacher's teaching level, there are multiple aspects to consider. This experiment primarily assesses the teacher's in-class performance by focusing on body posture. The evaluation criteria for this study are categorized into three areas: head, torso, and hands, as depicted in Fig. 1.

thumbnail Fig. 1 Evaluation criteria for teachers' lecture body-gesture

2 Methodology

2.1 Feature Extraction Based on Human Pose Estimation

Human key points and human behavior are inextricably linked. Extracting the key points of the teacher's body is a prerequisite for judging the teacher's behavior, which is an essential basis for evaluating the level of the teacher's lectures. Therefore, human behavior is a necessary factor in measuring behavioral characteristics such as interaction and expressiveness of the teacher's lectures. In this paper, we utilize the open-source DCPose algorithm to extract teacher posture features, to detect key points from captured images, and calculate human body behavior features for these points as a basis for machine scoring. The process is outlined below:

(1) Extraction of key points from the human body

In the PoseTracking17 dataset for human pose estimation, each teacher instance is annotated with 17 key points representing distinct body parts or feature points, accompanied by corresponding category labels[16], as depicted in Fig. 2(a). The DCPose algorithm predicts each teacher's human body key points and stores them in a JSON file, including label information, image pixel coordinates, and confidence levels of each key point. Additionally, visualization of the key points' skeleton connections was performed during the prediction process, as illustrated in Fig. 2(b).

thumbnail Fig. 2 (a) PoseTracking17 dataset key points; (b) Skeleton diagram visualization results

(2) Key points calculation features

By employing the DCPose algorithm, we have successfully extracted salient human key points from a vast collection of sampled frames derived from numerous educational lectures delivered by teachers. Based on the extraction of key points, we extract features of human behavior using the information transformation between frames. The 13 features included are: 1) left-hand trajectory, 2) right-hand trajectories, 3) left wrist-elbow-shoulder angle, 4) right wrist-elbow-shoulder angle, 5) left elbow-shoulder-hip angle, 6) right elbow-shoulder-hip angle, 7) the distance between the two wrists, 8) number of changes in the distance between the two wrists, 9) difference between the changes in the distance between the two wrists, 10) sums of body forward tilt angles, 11) the sum of angles of left and right body rotation, 12) nose trajectory and 13) the angle between the left and right ears. The generated feature database is partially displayed in Table 2.

Consequently, it is crucial to calculate the correlation between each extracted feature and the manual rating. In statistical analysis, Pearson's correlation coefficient is employed to assess the degree of linear association between two variables, X and Y, ranging from -1 to 1. The formula for computing this coefficient is as follows:

   ρ X , Y = c o v ( X , Y ) σ X σ Y = E ( ( X - μ X ) ( X - μ Y ) ) σ X σ Y = E ( X Y ) - E ( X ) E ( Y ) E ( X 2 ) - E 2 ( X ) E ( Y 2 ) - E 2 ( Y ) (1)

where cov(X,Y) is called the covariance of X, Y, σX is the standard deviation of X, and μX=E(X) is the expectation of X.

Figure 3 presents a heat map illustrating the correlation between each behavior characteristic in the teacher's lecture behavior database and the average score of 6 dimensions calculated manually. The horizontal coordinates correspond to the 13 features of human behavior and the vertical coordinates correspond to the 6 dimensions. One can observe that there is a high correlation between the teacher behavioral features extracted by the human posture estimation algorithm and the real scores given by the judges from the 6 dimensions.

thumbnail Fig. 3 Feature and manual score correlation hot-map

Table 2

Features database

2.2 Feature Extraction Based on Object Detection

The theory of deep learning in vision is mainly based on CNN. The convolutional neural network is a hierarchical neural network that can extract features layer by layer from the original image. These features can represent visual information such as edges, colors, shapes, etc. Through multi-layer convolution and pooling operations, convolutional neural networks can gradually abstract higher-level visual features and ultimately achieve task classification, detection, and segmentation of images. For face detection, this paper utilizes OpenCV's own harr-cascade detector to recognize the head of the image. OpenCV (Open-source Computer Vision Library) is a cross-platform computer vision and machine learning software library which is lightweight and efficient and supports many algorithms related to computer vision and machine learning. For gesture recognition, this paper uses the YOLOv5s model proposed by the ustralytics team; the YOLOv5s model is an object detection model based on the YOLO (You Only Look Once) algorithm[17], which is mainly used to recognize and localize objects in videos or images.

(1) Face-CNN+OpenCV to extract facial expression features

First, the built-in face detector in OpenCV is utilized for teacher's facial recognition. There are many feature classifiers in OpenCV; for example, the Harr feature in the OpenCV library has the face, the organs of the face, and the human body. The essence of this model is a classifier, also known as a cascade classifier, which we use to recognize faces, followed by applying a Face-CNN network for facial expression classification. The Fer2013 dataset, which comprises seven distinct expressions, is employed as the data source. During testing, the probability distribution of each teacher's 7 expressions is computed as an extracted feature of their facial expression. Figure 4 illustrates the architecture of the Face-CNN model. Conv, Pool, FC stand for convolutional layer, pooling layer, and fully connected layer, respectively.

thumbnail Fig. 4 Face-CNN model

(2) YOLOv5 extracts gesture features

In the teacher's lecture, gestures are very important. It can not only help the teacher to express his ideas in the lecture process better, but also enhance the appeal and attraction of the lecture. With proper gestures, teachers can be more confident, lively, and engaging, while also reflecting their personality and style. With the vigorous promotion of object detection technology, gesture detection has also been added to object detection, which has attracted hot attention in academic and industrial fields. This paper uses the YOLOv5s pre-training model proposed by the ustralytics team to carry out the development. The existing YOLOv5s pre-training model is implemented for the COCO dataset. For this specific data set of gestures, we need to set a gesture data set by ourselves and use the pre-trained model to continue transfer learning under the gesture data set to realize gesture recognition and then realize the recognition of the teacher's gesture. The training dataset was derived from the HaGRID dataset, which we have condensed and reduced in resolution due to its voluminous size. It comprises 18 gesture categories, each containing 300 images for training and 30 images for testing, referred to as HaGRID-Light. Gesture categories include one, ok, four, three, call, etc. Figure 5 displays the results of model training and validation, where box_loss, obj_loss, and cls_loss represent box regression loss, object confidence loss, and classification loss, respectively. We trained the model using the open-source YOLOv5s network and set epoch equals to 100.

thumbnail Fig. 5 The performance of YOLOv5s on the HaGRID-Light dataset regarding training and validation

By utilizing facial expression recognition and gesture recognition techniques, we extracted distinctive features from the frame data of numerous teachers. Features include the probability distribution of 7 common facial expressions and the frequency of no-hand gestures. The database of teacher features generated is shown in Table 3.

Table 3

Teacher database on facial expressions and gesture features

3 Experiments

3.1 Evaluation of Lecture Posture Based on Human Posture Estimation

After validating the efficacy of the DCPose algorithm in extracting human behavioral features, we employed machine learning regression algorithms for prediction. We analyzed the correlation and score deviation between predicted results and manual ratings. To ensure score prediction stability, this experiment utilized LR (Logistic Regression), SVM (Support Vector Machine), and CART (Classification and Regression Tree) regressors that were evaluated through cross-validation. The specific steps are as follows:

Step 1: Change the judges' full mark standard from 5 points to 100 points, and use the average score of the 3 judges' teachers as the final manual score;

Step 2: Using regressors, predictions are made for a sample of 563 teachers by taking the cross-validated machine scores as predictors;

Step 3: The correlation between machine scores and manual scores and the bias analysis is calculated, which is presented in Table 4. From Table 4, it can be seen that under a maximum score of 100 points, the proportion of machine-predicted scores within a difference of 10 points and within a difference of 20 points from human scores reached 51.72% and 81.87%, respectively. In cases where the score difference is above 50 points, it only accounted for 0.12%.

The analysis of results indicates a strong correlation between the predicted machine scores and manual scores, with an overall correlation level above 0.642. In particular, the human hand and human torso components largely exceeded the general level. However, the head relaxation and interaction dimensions were slightly lower in comparison. Therefore, the model will be further optimized at a later stage.

Table 4

Correlation and bias analysis of machine scores and manual scores based on human posture estimation

3.2 Evaluation of Lecture Posture Based on Multi-Model Fusion

For further optimization, an object detection algorithm is added to the extracted features to perform feature extraction on human facial expressions and gestures. The overall process architecture for the evaluation of speech contests under the use of multi-model fusion is shown in Fig. 6.

thumbnail Fig. 6 Overall process architecture

The experiment demonstrates that the fusion of multiple feature information under a multi-model framework significantly enhances the predictive performance when employing a regressor. The correlation and deviation analysis between machine scores and human scores, based on the fusion of multiple models, are presented in Table 5.

With multi-model fusion, face expression and gesture recognition algorithms were introduced to extract more human features, and the overall correlation between the predicted machine scores and the manual scores was improved by 2.26%. The percentage deviation of less than 20 points increased by 1.54%.

Table 5

Correlation and bias analysis of machine scores and manual scores based on multi-model fusion

4 Conclusion

This paper proposes a deep learning-based approach for evaluating lecture posture. First, we employ a human pose estimation algorithm to extract key body points and calculate behavioral features based on these points. Then, the extracted features are utilized by a regressor to predict machine scores. Additionally, we introduce an object detection algorithm to extract facial expressions and gesture features. Experimental results demonstrate that multi-model fusion significantly improves prediction performance.

References

  1. Zhang X Q, Li C C, Tong X F, et al. Efficient human pose estimation via parsing a tree structure based human model[C]//2009 IEEE 12th International Conference on Computer Vision. New York: IEEE, 2009: 1349-1356. [Google Scholar]
  2. Sun M, Kohli P, Shotton J. Conditional regression forests for human pose estimation[C]//2012 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2012: 3394-3401. [CrossRef] [Google Scholar]
  3. Dantone M, Gall J, Leistner C, et al. Human pose estimation using body parts dependent joint regressors[C]//2013 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2013: 3041-3048. [CrossRef] [Google Scholar]
  4. Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2012, 60: 84-90. [Google Scholar]
  5. Tompson J, Jain A, LeCun Y, et al. Joint training of a convolutional network and a graphical model for human pose estimation[EB/OL]. [2023-05-20]. https://arxiv.org/pdf/1406.2984. [Google Scholar]
  6. Toshev A, Szegedy C. DeepPose: Human pose estimation via deep neural networks[C]//2014 IEEE Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2014: 1653-1660. [CrossRef] [Google Scholar]
  7. He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 770-778. [Google Scholar]
  8. Wei S H, Ramakrishna V, Kanade T, et al. Convolutional pose machines[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 4724-4732. [CrossRef] [Google Scholar]
  9. Lin T Y, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2014: 740-755. [Google Scholar]
  10. Cao Z, Simon T, Wei S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2017: 1302-1310. [CrossRef] [Google Scholar]
  11. Chen Y L, Wang Z C, Peng Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 7103-7112. [CrossRef] [Google Scholar]
  12. Li W B, Wang Z C, Yin B Y, et al. Rethinking on multi-stage networks for human pose estimation[EB/OL]. [2023-05-21]. http://arxiv.org/abs/1901.00148. [Google Scholar]
  13. Xiao B, Wu H P, Wei Y C. Simple baselines for human pose estimation and tracking[C]//European Conference on Computer Vision. Cham: Springer-Verlag, 2018: 472-487. [Google Scholar]
  14. Sun K, Xiao B, Liu D, et al. Deep high-resolution representation learning for human pose estimation[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 5686-5696. [CrossRef] [Google Scholar]
  15. Li J F, Wang C, Zhu H, et al. CrowdPose: Efficient crowded scenes pose estimation and a new benchmark[C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2019: 10855-10864. [Google Scholar]
  16. Andriluka M, Iqbal U, Insafutdinov E, et al. PoseTrack: A benchmark for human pose estimation and tracking[C]//2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. New York: IEEE, 2018: 5167-5176. [CrossRef] [Google Scholar]
  17. Redmon J, Divvala S, Girshick R, et al. You only look once: Unified, real-time object detection[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 779-788. [CrossRef] [Google Scholar]

All Tables

Table 1

Manual scores under hand relaxation

Table 2

Features database

Table 3

Teacher database on facial expressions and gesture features

Table 4

Correlation and bias analysis of machine scores and manual scores based on human posture estimation

Table 5

Correlation and bias analysis of machine scores and manual scores based on multi-model fusion

All Figures

thumbnail Fig. 1 Evaluation criteria for teachers' lecture body-gesture
In the text
thumbnail Fig. 2 (a) PoseTracking17 dataset key points; (b) Skeleton diagram visualization results
In the text
thumbnail Fig. 3 Feature and manual score correlation hot-map
In the text
thumbnail Fig. 4 Face-CNN model
In the text
thumbnail Fig. 5 The performance of YOLOv5s on the HaGRID-Light dataset regarding training and validation
In the text
thumbnail Fig. 6 Overall process architecture
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.