Issue 
Wuhan Univ. J. Nat. Sci.
Volume 29, Number 2, April 2024



Page(s)  145  153  
DOI  https://doi.org/10.1051/wujns/2024292145  
Published online  14 May 2024 
Computer Science
CLC number: TP751
Image Semantic Segmentation Approach for Studying Human Behavior on Image Data
^{1}
School of Communication, Wuhan Textile University, Wuhan 430073, Hubei, China
^{2}
Walnut Street (Shanghai) Information Technology Co., Ltd., Shanghai 200051, China
^{3}
College of Economics & Management, Zhejiang University of Water Resources and Electric Power, Hangzhou 310018, Zhejiang, China
^{4}
Research Center for Digital Economy and Sustainable Development of Water Resources, Hangzhou 310018, Zhejiang, China
^{†} Corresponding author. Email: soloda@mail.ustc.edu.cn
Received:
28
August
2023
Image semantic segmentation is an essential technique for studying human behavior through image data. This paper proposes an image semantic segmentation method for human behavior research. Firstly, an endtoend convolutional neural network architecture is proposed, which consists of a depthseparable jumpconnected fully convolutional network and a conditional random field network; then jumpconnected convolution is used to classify each pixel in the image, and an image semantic segmentation method based on convolutional neural network is proposed; and then a conditional random field network is used to improve the effect of image segmentation of human behavior and a linear modeling and nonlinear modeling method based on the semantic segmentation of conditional random field image is proposed. Finally, using the proposed image segmentation network, the input entrepreneurial image data is semantically segmented to obtain the contour features of the person; and the segmentation of the images in the medical field. The experimental results show that the image semantic segmentation method is effective. It is a new way to use image data to study human behavior and can be extended to other research areas.
Key words: human behavior research / image semantic segmentation / hopconnected full convolution network / conditional random field network / deep learning
Cite this article: ZHENG Zhan, CHEN Da, HUANG Yanrong. Image Semantic Segmentation Approach for Studying Human Behavior on Image Data[J]. Wuhan Univ J of Nat Sci, 2024, 29(2): 145153.
Biography: ZHENG Zhan, female, Ph. D., Associate professor, research direction: image processing. Email: czheng@wtu.edu.cn
Fundation item: Supported by the Major Consulting and Research Project of the Chinese Academy of Engineering (2020CQZD1), the National Natural Science Foundation of China (72101235) and Zhejiang Soft Science Research Program (2023C35012)
© Wuhan University 2024
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0 Introduction
The research on human behavior is based on the big data method. It includes three aspects: numerical data analysis, text data analysis, and image data analysis.
With the rapid development of mobile devices and Internet technology, the image has become an essential carrier for more and more people to record their behavior. It records the moment of behavior and is an important information source for analyzing behavior characteristics. With the iterative innovation of artificial intelligence technology, image semantic segmentation technology provides us with a creative means to extract and understand the information in the image. The development of this technology can be roughly divided into three stages ^{[1]}: ① the method of traditional image semantic segmentation technology stage can only achieve image segmentation, not image semantic segmentation; ② The semantic segmentation stage combining deep learning and traditional technology — using the Convolutional Neural Networks (CNN) algorithm to achieve the effect of semantic segmentation; ③The Fully Convolutional Networks (FCN) for semantic segmentation is the mainstream image semantic segmentation technique nowadays.
CNN have shown extraordinary ability in image data processing. As the task of image semantic segmentation needs to classify each pixel of the input image, the general CNN cannot do it, and it needs to be designed manually, but it has the disadvantages of long training time and low accuracy.
Aiming at the above problems, this paper proposes an endtoend convolutional neural network architecture to realize the semantic segmentation of human behavior images. The network consists of a deep Separator Stride Fully Convolutional Network (SSFCN) and a Dense Conditional Random Field (DCRF). The algorithm is named "SSFCNDCRF image semantic segmentation deep learning algorithm for human behavior research".
This article has the following innovations:
1) An endtoend convolutional neural network architecture (Separator Stride Fully Convolutional Network and Dense Conditional Random Field, SSFCNDCRF) for semantic segmentation of human behavior images is proposed, consisting of a deeply separable SSFCN and DCRF.
2) A skipconnected fully convolutional network SSFCN is proposed, which combines convolutional and pooling layers with skipconnected convolution to classify each pixel in the image. A semantic segmentation method based on convolutional networks is proposed.
3) Using the DCRF network to improve the image segmentation effect of human behavior. In this paper, we propose to implement linear modeling based on semantic segmentation of conditional random field images by using Gibbs distribution and introducing energy function.
We use fully connected conditional random fields and introduce two random variables — hidden and observed — to implement nonlinear modeling based on semantic segmentation of conditional random field images.
1 Related Work
1.1 Traditional Image Semantic Segmentation Algorithms
The traditional method of image semantic segmentation is to design some features manually, and then use the classifier in machine learning to complete the semantic segmentation task. According to the different classification algorithms, it can be divided into three categories: ① Image semantic segmentation based on threshold segmentation. The image is converted to a gray image. Then several segmentation thresholds are determined, and then the pixels in the gray image are divided into corresponding categories according to each defined threshold. Each segmentation threshold corresponds to a segmentation map. ②Image semantic segmentation based on the clustering algorithm. The image semantic segmentation problem is transformed into a clustering problem in machine learning, and the segmentation graph can be output after the iteration of the clustering algorithm. The common clustering algorithms include Kmeans clustering^{ [2]}, mean shift clustering^{ [3]}, Gaussian mixture model clustering^{ [4]} and agglomerative hierarchical clustering^{ [5]}. Different clustering targets can be divided into RGB color value clustering^{ [6]}, gray value clustering^{ [7]} and pixel spatial position clustering^{ [8]}. ③ Image semantic segmentation based on graph theory is to treat the problem of image semantic segmentation as the minimum cut problem in graph theory. Think of an image as an undirected weighted graph, expressed by G=(V, E), where V represents each pixel in the image, E represents the connection between pixels, and the weight of edges represents the difference in the correlation between adjacent pixels. A partition graph of the image corresponds to a partition S in the undirected weighted graph, and each subregion C (C, S) in the partition corresponds to a subgraph in the graph. Common segmentation methods based on graph theory include GraphCut^{ [9]}, GrabCut^{ [10]} and Random Walk^{ [11]}.
1.2 StateoftheArt Research on Image Semantic Segmentation Based on CNN
With the success of CNN in some fields of computer vision (such as face recognition^{ [12]}, object detection^{ [13]}, pedestrian recognition^{ [14]}, etc.), more and more researchers have applied them to image semantic segmentation^{[15 17]}.
The network architecture based on full convolutional networks^{[18]} predicts a single pixel directly, which is an endtoend training method. The network architecture of FCN^{ [19]} is in the form of a coderdecoder. The encoder, namely the feature extraction module, is used to extract the image of the features. The decoder, that is, the upsampling module, is used to output the final split image. The decoder needs to expand the size of the feature map, such as linear interpolation and bilinear interpolation. Decoders can also accomplish sampling in the form of an inverse convolution, such as SegNet^{ [20]}, DeconvoNet^{ [21]}, and Cipola SegNet^{ [22]}.
1.3 Human Behavior Research by Image Semantic Segmentation Method
1.3.1 Human behavior research
The use of image semantic segmentation to study human behavior can be divided into three levels. The first level of lowlevel vision is human detection, which extracts the category of "human" in the picture. The second level of intermediate vision is human tracking, which targets the person in the picture and recognizes the movement characteristics of the person. The third level, highlevel vision, is behavioral understanding, the process of assigning meaning to human actions.
According to the complexity of the technology used, the behavior of using image semantic segmentation to study can be divided into three categories. ①Recognition and detection of individual human behavior, such as Zhang et al^{[23]} who built a videobased human abnormal behavior judgment process, Shao et al^{[24]} proposed a keyframe cascade recognition network and spacetime map convolution ASTGCN (Attention STGCN). ②Multiperson interaction behavior recognition and detection. Wang et al^{[25]} pointed out that multiperson interaction behavior recognition and detection need to pay attention to the relationship between people, which carries the key information to interpret group behavior. The group behavior recognition methods are summarized and classified into: conventional noninteraction type, model based on interaction relationship, model based on key person interaction and multiple decision fusion type^{[25]}. ③Detection of behavior tracks. Behavior is often not an instantaneous event, but a process composed of multiple nodes. Based on deep network learning, Hu ^{[26]} extracted the internal characteristic law from the complex pedestrian movement path and predicts people's movement path in the next stage.
1.3.2 Dynamic scenario research in time and space dimensions
At present, it is a hot topic to consider the real dynamic situation in time and space dimensions. Ji et al^{[27]} proposed to capture spatial and temporal information from video and Wald et al^{[28]} presented a new neural network architecture of 3D data to explore the relationship between entities, and returned semantics from a given 3D scene through learning. Fernando et al^{[29]} combined motion dynamics into the image and feed the image into any standard CNN for endtoend learning. Bilen et al^{[30]} proposed a dynamic method to represent the motion of image sequence by time sequence. Khowaja et al^{[31]} performed local sparse segmentation using global clustering to construct semantic images .
To summarize, the method of image semantic segmentation based on the convolutional neural network has the following two problems: first, image semantic segmentation needs to classify each pixel of the input image, which is generally not possible with a convolutional neural network and is mainly designed manually; second, the image segmentation algorithm has the disadvantage of long training time and low accuracy. To address these two problems, this paper proposes a deep learning algorithm for image semantic segmentation for human behavior study.
2 Image Semantic Segmentation Method Based on Convolutional Neural Networks (CNN)
2.1 EndtoEnd CNN Architecture
To achieve image semantic segmentation of human behaviors, this paper proposes an endtoend convolutional neural network architecture consisting of a depthwise separable skip connected fully convolutional network (SSFCN) and a conditional random field network (DCRF). SSFCN classifies each pixel in the image, and DCRF improves human behavioral image segmentation.
2.2 Basic Building Blocks of CNN
2.2.1 Convolution
The convolution layer is composed of multiple convolution cores. The input of the convolution layer is a multichannel image. The convolution kernel performs a convolution operation on the input image in a specific step and outputs the result. The convolution layer has two characteristics: ① Sparse connection connect. Each convolution kernel is only related to a particular region of the input characteristic graph. ② Weight sharing. All convolution kernels share the same parameter. Based on these two characteristics, the parameters of the convolution neural network will be significantly reduced. The role of the convolution kernel is to extract the visual features in the image, and the convolution kernel of different sizes can extract different levels of feature information. Figure 1 shows the process of extracting different levels of feature information from convolution kernels of various sizes.
Fig. 1 The process of extracting different levels of feature information with convolution kernels of different sizes 
Convolution neural network is a multilayer network structure. The convolution cores of different layers receive the output of the previous convolution core, and take the output as the input of the next convolution core. Generally, the lower convolution kernel detects the more basic visual features (such as horizontal lines, vertical lines, etc.), and the higher convolution kernel detects the more specific visual features (such as circles, boxes, etc.). The convolution kernel of different layers can be combined to extract most features. The final result of the convolution operation is a characteristic graph.
2.2.2 Pool layer
The role of pooling layers is to solve the problem of parameter explosion caused by simply stacking convolution layers. It compresses the input characteristic image by sampling. Common pooling compression methods used include maximum pooling and average pooling.
The pooling layer has two characteristics: ① feature invariance. That is, when an image undergoes simple transformations such as flipping, translation, rotation and scaling, it can also extract the same features at the same location. ② Feature dimension reduction. That is, after the pooling layer, the feature map is reduced, reducing the size of the input of the next layer, and reducing the amount of calculation and parameters of the entire network structure. For example, the input image is 224×224×3, and after the maximum pooling operation with a step of 2, its feature map is compressed to 112×112×3.
2.2.3 Full connection layer
The role of fully connected layers (FC) is "classifier". By introducing nonlinear transformation, it maps the learned "distributed feature representation" to the sample marker space and transmits it to the classifier.
The feature map is mapped to probability to classify the whole network at the full connection layer. As shown in Fig. 2, a feature map of the size of 3×3 is expanded into three onedimensional vectors of the size of 1×3 by row, and finally spliced into a onedimensional vector of 1×9 in sequence providing input for the classifier.
Fig. 2 The process of flattening the characteristic image output by the convolution layer into a onedimensional vector 
After the full connection layer, the (∞,+∞) fraction z_{j} of K categories is obtained. In order to obtain the probability belonging to each category, the fraction is first mapped to (0,+∞) through ${\mathrm{e}}^{{z}_{j}}$, and then normalized to (0, 1). The output of the full connection layer cannot directly represent the probability that the image belongs to a category. Therefore, the softmax ^{[32]} formula must be used for conversion, see formula (1).
$\alpha {(Z)}_{j}=\frac{{{l}}^{{z}_{j}}}{{\displaystyle \sum _{k=\mathrm{1}}^{K}}{{l}}^{{z}_{k}}}$(1)
where j=1,…, K, K represents the total number of targets to be classified. In this way, the output of each neuron is mapped to the probability of belonging to a specific category and meets the requirement that the sum of the mapping values of all neurons is 0.
The full connection Layer+softmax views the picture from three perspectives: ① Weighting, which takes weight as the importance of each dimension feature; ② Template matching to help understand the visualization of parameters; ③ From a geometric point of view, the feature is regarded as a point in a multidimensional space. The properties of different types of points can help understand the design idea behind some loss functions.
2.3 Image Semantic Segmentation Based on FCN
The image semantic segmentation algorithm based on a convolution neural network is completed by FCN. The convolution layer is used to replace the final full connection layer to achieve the classification of each pixel in the image. The network structure is divided into two parts: feature extraction and upper sampling. First, the combination of the convolution layer and pooling layer is used to extract the image's visual characteristics. Then the deconvolution method is used to restore the size of the feature image to the size of the original image, and the final segmented image is generated.
3 Deep Learning Algorithm for Image Semantic Segmentation in Conditional Random Field Networks
The key to image recognition of human behavior is semantic segmentation of the image. To solve the problem of poor image edge segmentation caused by the direct application of a full convolution neural network, this paper proposes a dence conditional random field network (DCRF) to optimize the segmentation results output by a full convolution neural network SSFCN.
3.1 Segmenting Image Semantics with DCRF Networks
Because the classical conditional random field model has a large number of connection edges, directly applying it to image semantic segmentation will cause too much computation due to too many image pixels. The mean field approximation algorithm in variational inference will be used for approximate calculation. One iteration of the mean field approximation algorithm will be designed to represent the convolution layer. Meanwhile, the multiple iterations of the algorithm will be expressed as the form of the cyclic neural network to form the conditional random field network DCRF. The whole network will be trained using the backpropagation algorithm. The deep learning algorithm of human behavior image segmentation proposed in this paper is to combine conditional random field network DCRF and full convolution network SSFCN to form an endtoend image semantic segmentation network SSFCNDCRF, which is called SSFCNDCRF image semantic segmentation deep learning algorithm. The whole network can be trained using a backpropagation algorithm.
The key of this algorithm is image semantic segmentation modeling based on DCRF, including linear and nonlinear parts.
3.2 Linear Modeling for Semantic Segmentation of Images in DCRF Networks
Gibbs distribution is used for linear modeling of image semantic segmentation based on the conditional random field^{[32]}. Gibbs distribution is the probability distribution of the undirected graph model expressed by factor. The specific expression of Gibbs distribution is shown in formula (2).
$P\left({X}_{\mathrm{1}},{X}_{\mathrm{2}},\cdots ,{X}_{n}\right)=\frac{\mathrm{1}}{Z\left(X\right)}\tilde{P}\left({X}_{\mathrm{1}},{X}_{\mathrm{2}},\cdots ,{X}_{n}\right)$(2)
where
$\tilde{P}\left({X}_{\mathrm{1}},{X}_{\mathrm{2}},\cdots ,{X}_{n}\right)={\displaystyle \prod _{i=k}^{m}}{\phi}_{i}\left(X\right)$(3)
$Z\left(X\right)=\sum \prod {\phi}_{i}\left(X\right)$(4)
Equation (2) is the normalization coefficient, and equation (3) is the factor function. In order to make this model convenient for image semantic segmentation, the implementation form of the factor function is redefined, as shown in (5).
$\phi \left(X\right)=\mathrm{e}\mathrm{x}\mathrm{p}\left(\xi \left(X\right)\right)$(5)
where $\xi \left(X\right)$ is the energy functions.
Finally, the linear model of image semantic segmentation based on a conditional random field is obtained, as shown in formula (6).
$P\left({X}_{\mathrm{1}},{X}_{\mathrm{2}},\cdots ,{X}_{n}\right)=\mathrm{e}\mathrm{x}\mathrm{p}\left({\displaystyle \sum _{i=k}^{m}}{\xi}_{i}\left(X\right)\right)$(6)
It can be seen that due to the introduction of the energy function, the elements in equation (3) are multiplied by the natural logarithm to become additive. That is, the multiplication relationship between the elements becomes the addition relationship.
3.3 Nonlinear Modeling of Semantic Segmentation of DCRF Images
The nonlinear modeling of image semantic segmentation based on a conditional random field will use the fully connected conditional random field (DCRF) to model the semantic segmentation problem. Because DCRF meets the Gibbs distribution, as shown in formula (7).
$P\left(YX\right)=\frac{\mathrm{1}}{Z\left(X\right)}\tilde{P}\left(Y,X\right)$(7)
where
$\tilde{P}\left(Y,X\right)=\mathrm{e}\mathrm{x}\mathrm{p}\left(\sum _{i}{w}_{i}\mathrm{*}{f}_{i}\left(Y,X\right)\right)$(8)
$Z\left(X\right)=\sum _{Y}\mathrm{e}\mathrm{x}\mathrm{p}\left(\sum _{i}{w}_{i}\mathrm{*}{f}_{i}\left(Y,X\right)\right)$(9)
In equations (7), (8), and (9), Y is called the hidden variable and X is called the observed variable. In image semantic segmentation, Y represents the category label to which the pixel belongs, and X represents the information that each pixel can be directly observed. In the process of modeling with DCRF, it is necessary to calculate the joint probability density of two random variables (Y and X), which is equivalent to the energy function. In this section, we designed several energy functions to obtain the final modeling form of DCRF for the image semantic segmentation task, that is, nonlinear modeling based on conditional random field image semantic segmentation, as shown in formula (10):
$\tilde{P}\left(Y,X\right)=\mathrm{e}\mathrm{x}\mathrm{p}\left(\sum _{i}{f}_{\mathrm{1}}\left({X}_{i},{Y}_{i}\right)+{f}_{\mathrm{2}}\left({Y}_{i},{Y}_{i+\mathrm{1}}\right)\right)$(10)
In formula (10), $({f}_{\mathrm{1}}\left(X,Y\right))$ is a firstorder energy function, representing the information entropy brought by assigning a category label j to a pixel point i; (${f}_{\mathrm{2}}\left(Y,Y\right)$) is a secondorder energy function, which is the case that twopixel points are assigned at the same time.
When all pixels in the image are connected to each other in pairs. The secondorder energy function can be expanded; see formula (11).
${\psi}_{p}\left({x}_{i},{x}_{j}\right)=u\left({x}_{i},{x}_{j}\right){\displaystyle \sum _{m=\mathrm{1}}^{K}}{w}^{m}{k}^{m}\left({f}_{i},{f}_{j}\right)$(11)
Set the kernel function $k\left({f}_{i},{f}_{j}\right)$ to:
$\begin{array}{l}k\left({x}_{i},{x}_{j}\right)=\mu \left({x}_{i},{x}_{j}\right)\\ \left[{w}_{\mathrm{1}}\mathrm{e}\mathrm{x}\mathrm{p}\left(\frac{{\Vert {p}_{i}{p}_{j}\Vert}^{\mathrm{2}}}{\mathrm{2}{\sigma}_{\alpha}^{\mathrm{2}}}\frac{{\Vert {I}_{i}{I}_{j}\Vert}^{\mathrm{2}}}{\mathrm{2}{\sigma}_{\beta}^{\mathrm{2}}}\right)+{w}_{\mathrm{2}}\mathrm{e}\mathrm{x}\mathrm{p}\left(\frac{{\Vert {p}_{i}{p}_{j}\Vert}^{\mathrm{2}}}{\mathrm{2}{\sigma}_{\lambda}^{\mathrm{2}}}\right)\right]\end{array}$(12)
In formula (12), $\mu ({x}_{i},{x}_{j})$ represents pixel ${x}_{i}$ and pixel ${x}_{j}$. The information entropy between j is $\mu \left({x}_{i},{x}_{j}\right)=\mathrm{1}$ only if ${x}_{i}\ne {x}_{j}$, otherwise $\mu \left({x}_{i},{x}_{j}\right)=\mathrm{0}$. The above conditions mean that only nodes with different labels will generate information entropy, and the information entropy between nodes with the same label is 0. In formula (12), two different Gaussian kernel functions are also defined according to the difference of feature space: the first Gaussian kernel will consider the pixel position expressed as p, the pixel gray value expressed as i, and the second Gaussian kernel will only consider the spatial position of the pixel expressed as i. Hyperparameters w_{1 }and w_{2 }control the size of two Gaussian kernels. Hyperparameters ${\sigma}_{\alpha}$, ${\sigma}_{\beta}$ and ${\sigma}_{\gamma}$ are used to control the weight of position and color information in each Gaussian kernel. In terms of effect, the first Gaussian kernel will encourage similar category labels to be assigned to pixels with similar color and position. In contrast, the second Gaussian kernel only considers the spatial correlation between pixels.
4 Case
This section presents two cases of using the proposed method. Case 1 is used for entrepreneur image segmentation and case 2 is used for medical image segmentation, which is an extension of human behavior research.
4.1 SSFCN Image Segmentation of Entrepreneurial Images
4.1.1 Data set — Entrepreneur image
The data set used in this paper is the entrepreneur image crawled from the government website. By defining different fields, 39 entrepreneurs representing various industries were selected. There are 600 original images in total. The original images are expanded to 1 950 by using data enhancement methods (random rotation, rotation, cropping, random setting of image brightness and contrast, and random left and right rotation in training to generate more images), with 50 for each enterprise. 1 500 of them are used as training sets and 450 as test sets.
4.1.2 Image segmentation based on SSFCN
Entrepreneur data sets are trained using the full convolution neural network structure. The structure of the full convolution neural network is shown in Fig. 3.
Fig. 3 Full convolution neural network structure 
As shown in Fig. 3, the full convolution neural network structure network is composed of a feature extraction path (left side) and an upper sampling path (right side). The feature extraction path is composed of three feature extraction modules. Each feature extraction module contains three deep separable jump concatenated convolutions with different convolution core sizes (3×3, 5×5, 7×7). In order to strengthen the feature extraction capability of the network, the number of convolution cores in the three feature extraction modules is set to 64, 128, 256. It should be noted that the output of the merge layer in each feature extraction module is saved as an intermediate result. In the upper sampling path, these intermediate results will be used again. In the upsampling path, each upsampling module consists of a common convolution layer with a convolution core size of 3×3 (step side=1) and a transposed convolution layer with a convolution core size of 2×2 (step side=2). After each upper sampling module, the size of the feature map will be enlarged to twice the original size. Before entering the next upsampling module, the output feature map of each upsampling module will be fused with the corresponding intermediate results saved in the feature extraction path (add pixel by pixel) to generate a new feature map. For example, the output of the first upsampling module will be fused with the output of the third feature extraction module, and the new feature map generated after the fusion will be used as the input of the second upsampling module. At the network's last layer, this case uses the convolution layer with the convolution core size of 1×1 to map the 64dimensional feature vector to the required number of classes (here we have two classes, and the number of convolution cores is set to 2.
In this case, the image segmentation network proposed in this paper can be used for semantic segmentation of the input image, and the segmentation results will show the contour features of the characters.
Using the segmentation model proposed in this paper can not only support the study of human behavior, but also has good performance for image segmentation in the medical field, which shows that the method has some generalization and universality.
4.2 Segmentation of Images in the Medical Field
The dataset used in this case is the Kaggle lung CT dataset. This dataset is a set of twodimensional CT scans of the lungs of cancer patients and the corresponding segmentation maps after manual calibration. The CT scans and the corresponding segmentation maps are shown in Fig. 4 (a) and (b), respectively. The number of original images is 58 and the image size is 512×512 pixels. To better utilize the dataset, we used the data enhancement method to expand the data of the original images (58 images), and finally get 500 images. We divided the enhanced dataset into two parts, where 80% of the images (400 images) are used for the training set and 20% (100 images) are used for the validation set. For the test set, we used 58 images from the original dataset. It can be seen that our segmentation model has good segmentation performance for images in the medical domain as well, with some generalization.
Fig. 4 Kaggle lung CT lung scans 
4.3 Comparison of Segmentation Effects
The comparison of segmentation performance (accuracy) between the method proposed in this article and current mainstream methods is shown in Table 1. The current mainstream methods include: IDSIA^{[33]} (inverse distance spatial interpolation algorithm), SegNet^{[20]} (a deep convolutional encoderdecoder architecture for image segmentation), UNet^{[34]} (convolutional networks for biomedical image segmentation), DeepLabV3^{[35]} (deep lab semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs).
Comparison of segmentation accuracy (unit:%)
5 Conclusion
In this paper, for the problem of studying human behavior with image data describing human behavior, we propose an image semantic segmentation method for human behavior study. Specifically, the following researches are done and innovated.
1) An endtoend convolutional neural network architecture (Separator Stride Fully Convolution NetworkDense Conditional Random Field, SSFCNDCRF) is proposed, which consists of a depthseparable jumpconnected fully convolutional network SSFCN and a conditional random field network DCRF.
2) The jumpconnected convolution is used to classify each pixel in an image, and a convolutional neural networkbased semantic segmentation method for images is proposed.
3) A Conditional Random Field Network (Dense Conditional Random Field (DCRF)) is used to improve the effect of image segmentation of human behavior, and linear modeling and nonlinear modeling based on semantic segmentation of Conditional Random Field images are proposed.
4) Full convolutional network image segmentations are implemented for entrepreneurial images and images in the medical field.
However, in the experimental part, there are some limitations in our research. In this paper, the segmentation of full convolutional network images of entrepreneur images, only made the results, has not yet carried out a comparative analysis of the effect of image segmentation. In the future, we will continue to collect entrepreneur image data, conduct a more comprehensive comparative analysis of image segmentation effects, and further expand into other areas.
References
 Kuang H Y, Wu J J. Survey of image semantic segmentation based on deep learning[J]. Computer Engineering and Applications, 2019, 55(19): 1221, 42(Ch). [Google Scholar]
 Jain A K. Data clustering: 50 years beyond Kmeans[J]. Pattern Recognition Letters, 2010, 31(8): 651666. [CrossRef] [Google Scholar]
 Barash D, Comaniciu D. Meanshift clustering for DNA microarray analysis[C]//Proceedings of the 2004 IEEE Computational Systems Bioinformatics Conference. New York: ACM, 2004: 578579. [Google Scholar]
 Janouek J, Gajdo P, Radecky M, et al. Gaussian mixture model cluster forest[C]//2015 IEEE 14th International Conference on Machine Learning and Applications (ICMLA). New York: IEEE, 2015: 10191023. [CrossRef] [Google Scholar]
 Johnson S C. Hierarchical clustering schemes[J]. Psychometrika, 1967, 32(3): 241254. [CrossRef] [Google Scholar]
 Wang Y X, Zhao X J. Improvement of color image segmentation algorithm based on Kmeans clustering[J]. Computer Application and Software, 2010, 27(8): 127130(Ch). [Google Scholar]
 Li Y S, Li M. Fuzzy Cmeans clustering image segmentation based on gray space features[J]. Computer Engineering and Design, 2007, 28(6): 13581360, 1363(Ch). [Google Scholar]
 Kang J Y, Min L Q. Image segmentation based on weighted fuzzy Cmeans clustering accounting for pixel spatial information[J]. Journal of University of Science and Technology Beijing, 2008, 30(9): 10721078(Ch). [Google Scholar]
 Kwatra V, Schödl A, Essa I, et al. Graphcut textures: Image and video synthesis using graph cuts[J]. ACM Transactions on Graphics, 2003, 22(3): 277286. [CrossRef] [Google Scholar]
 Rother C. GrabCut: Interactive foreground extraction using iterated graph cuts[C]// Proceedings of Siggraph, 2004, 23: 309314. [CrossRef] [Google Scholar]
 Grady L. Random walks for image segmentation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2006, 28(11): 17681783. [CrossRef] [PubMed] [Google Scholar]
 Parkhi O M, Vedaldi A, Zisserman A. Deep face recognition[C]//Proceedings of the British Machine Vision Conference 2015. London: British Machine Vision Association, 2015: 16. [Google Scholar]
 Girshick R. Fast RCNN[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 14401448. [CrossRef] [Google Scholar]
 Li J N, Liang X D, Shen S M, et al. Scaleaware fast RCNN for pedestrian detection[J]. IEEE Transactions on Multimedia, 2018, 20(4): 985996. [Google Scholar]
 Zheng Z D, Yang Y. Rectifying pseudo label learning via uncertainty estimation for domain adaptive semantic segmentation[J]. International Journal of Computer Vision, 2021, 129(4): 11061120. [CrossRef] [Google Scholar]
 Min S B, Chen X J, Zha Z J, et al. A twostream mutual attention network for semisupervised biomedical segmentation with noisy labels[J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 45784585. [Google Scholar]
 Hwang J J, Yu S, Shi J B, et al. SegSort: Segmentation by discriminative sorting of segments[C]//2019 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2019: 73347344. [Google Scholar]
 Wang L J, Ouyang W L, Wang X G, et al. Visual tracking with fully convolutional networks[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 31193127. [CrossRef] [Google Scholar]
 Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(4): 640651. [CrossRef] [PubMed] [Google Scholar]
 Badrinarayanan V, Kendall A, Cipolla R. SegNet: A deep convolutional encoderdecoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(12): 24812495. [CrossRef] [PubMed] [Google Scholar]
 Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation[C]//2015 IEEE International Conference on Computer Vision (ICCV). New York: IEEE, 2015: 15201528. [CrossRef] [Google Scholar]
 Badrinarayanan V, Kendall A, Cipolla R. SegNet: A deep convolutional encoderdecoder architecture for image segmentation[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(12):24812495. [CrossRef] [PubMed] [Google Scholar]
 Zhang X P, Ji J H, Wang L, et al. Overview of videobased human abnormal behavior recognition and detection methods [J]. Control and Decision, 2022, 37(1): 1427(Ch). [Google Scholar]
 Shao Y H, Li W F, Zhang X Q, et al. Identification of aerial violence based on spacetime map convolution and attention model[J]. Computer Science, 2022, 49(8): 6(Ch). [Google Scholar]
 Wang C X, Liu R. Group activity recognition algorithm based on interaction relationship grouping modeling fusion[J]. Computer and Modernization, 2022(1): 19(Ch). [Google Scholar]
 Hu S. Research on Learning and Prediction Model of Crowd Movement Trajectory Representation Based on Deep Learning[M]. Beijing: Beijing University of Chemical Technology, 2021(Ch). [Google Scholar]
 Ji S W, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35(1): 221231. [Google Scholar]
 Wald J, Navab N, Tombari F. Learning 3D semantic scene graphs with instance embeddings[J]. International Journal of Computer Vision, 2022, 130(3): 630651. [Google Scholar]
 Fernando B, Gavves E, Jose Oramas M, et al. Modeling video evolution for action recognition[C]//2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2015: 53785387. [CrossRef] [Google Scholar]
 Bilen H, Fernando B, Gavves E, et al. Dynamic image networks for action recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2016: 30343042. [CrossRef] [Google Scholar]
 Ali Khowaja S, Lee S L. Semantic image networks for human action recognition[J]. International Journal of Computer Vision, 2020, 128(2): 393419. [CrossRef] [Google Scholar]
 Dunne R A, Campbell N A. On the pairing of the softmax activation and crossentropy penalty functions and the derivation of the softmax activation function[C]//Proceedings of the 8th Australian Conference on the Neural Networks. Melbourne: ANN, 1997: 181185. [Google Scholar]
 Chen J, Fu J, Zhang M. An atmospheric correction algorithm for Landsat/TM imagery basing on inverse distance spatial interpolation algorithm: A case study in Taihu Lake[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2011, 4(4): 882889. [Google Scholar]
 Falk T, Mai D, Bensch R, et al. UNet: Deep learning for cell counting, detection, and morphometry [J]. Nature Methods, 2019, 16(1): 6770. [CrossRef] [PubMed] [Google Scholar]
 Tang Y, Tan D, Li H, et al. RTC_TongueNet: An improved tongue image segmentation model based on DeepLabV3[J]. Digital Health, 2024, 10: 20552076241242773. [CrossRef] [PubMed] [Google Scholar]
All Tables
All Figures
Fig. 1 The process of extracting different levels of feature information with convolution kernels of different sizes  
In the text 
Fig. 2 The process of flattening the characteristic image output by the convolution layer into a onedimensional vector  
In the text 
Fig. 3 Full convolution neural network structure  
In the text 
Fig. 4 Kaggle lung CT lung scans  
In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.