A Fault Diagnosis Model for Complex Industrial Process Based on Improved TCN and 1D CNN

: Fast and accurate fault diagnosis of strongly coupled, time-varying, multivariable complex industrial processes remain a challenging problem. We propose an industrial fault diagnosis model. This model is established on the base of the temporal con‐ volutional network (TCN) and the one-dimensional convolutional neural network (1DCNN). We add a batch normalization layer be‐ fore the TCN layer, and the activation function of TCN is replaced from the initial ReLU function to the LeakyReLU function. To ex‐ tract local correlations of features, a 1D convolution layer is added after the TCN layer, followed by the multi-head self-attention mechanism before the fully connected layer to enhance the model 􀆳 s diagnostic ability. The extended Tennessee Eastman Process (TEP) dataset is used as the index to evaluate the perfor‐ mance of our model. The experiment results show the high fault recognition accuracy and better generalization performance of our model, which proves its effectiveness. Additionally, the model's application on the diesel engine failure dataset of our partner's project validates the effectiveness of it in industrial scenarios.


Introduction
With the development of Industry 4.0 and intelligent manufacturing, industrial equipment has become increasingly integrated, complex, and intelligent; along with the rapid growth in the quantity of data, the performance of equipment failure is becoming more complex.The failure of complex precision industrial equipment often causes immense losses, so the accurate diagnosis of its failure is increasingly becoming a research priority.
Commonly used fault diagnosis methods are mainly divided into model-based, knowledge-based, and data-driven methods [1] .Model-based methods establish mathematical models for equipment, such as state space equations [2,3] , to study different dynamic parameters and responses of equipment in normal and faulty states; knowledge-based reasoning methods are used in the case of having obtained prior knowledge of equipment failures; by combining practical experience, system principles and historical fault information, these methods can infer the reason of failure, such as fault trees based on Bayesian networks [4,5] ; data-driven methods extract features from the collected historical fault data of equipment to diagnose faults, such as K-means clustering algorithm [6,7] , principal component analysis (PCA) algorithm [8,9] and so on.The model-based method requires professional knowledge of the relevant equipment to establish models, yet the model is complex, with poor adaptability and reliability, and is prone to false positives and false negatives; the fault diagnosis method based on knowledge reasoning relies heavily on the prior knowledge of equipment faults, which requires a combination of a lot of practical experience to identify faults, and are often unidentifiable for unknown faults.With the continuous breakthrough of deep learning technology in image and natural language processing [10][11][12][13][14][15] , the application of deep learning technology to fault diagnosis to improve its efficiency and accuracy has become a research hotspot currently.In the past literature, the recurrent neural networks (RNN) [16] like Long-Short Term Memory (LSTM) [17] network and Gate Recurrent Unit (GRU) [18] network are the commonly used deep learning networks in fault diagnosis.However, the deep RNN models face the problems of gradient disappearance and gradient explosion [19][20][21] , and due to the dependence on the previous time step, RNN models have worse parallelization capability.
As CNN models have caught attention in time series processing recently, Wu et al [22] firstly introduced a deep convolutional neural network (CNN) to multivariate time series fault diagnosis, which formed the data into [variate, time step] matrix to apply two-dimensional convolution, and achieved 88.2% accuracy on test set; Song et al [23] used a multi-scale two-dimensional convolution to identify chemical process faults and achieved 88.54% accuracy on test set; Deng et al [24] introduced a genetic algorithm to reorder features before CNN and achieved 89.72% accuracy on the test set.These CNN models use two-dimensional convolution and pooling layers to identify faults, which may face accuracy loss and still have room for improvement.The temporal convolutional network (TCN) proposed by Bai et al [25] uses dilated convolution to better perceive long sequences and exceed LSTM and GRU in time-series prediction tasks.However, TCN can only process one-dimensional time-series data and cannot process multi-dimensional complex industrial data.
In order to accurately identify the faults in a complex multivariate industrial process that are strongly coupled and time-varying, a fault identification model based on improved TCN and one-dimensional convolution is proposed.In the model, the activation function of TCN is replaced by the LeakyReLU function, an extra one-dimensional convolution layer is introduced in the feature dimension to extract local correlation features, and the multi-head self-attention layer is introduced before the fully connected layer to establish a TCN-1DCNN-Attention (TCA) model.Finally, the effectiveness and generalization of this model are checked on by comparing the fault recognition rate with traditional RNN models like LSTM, GRU and Transformer model.
The paper is organized in the following order: Section 1 is the introduction of preliminaries, including TCN and Attention.Section 2 is the comprehensive description of our proposed model.Section 3 is the experiments and analysis, and Section 4 is the conclusion.

Temporal Convolutional Neural Network
TCN is improved on the base of the Time-Delay Neural Network (TDNN) proposed by Waibel et al [26] , which has been widely used in time series modeling [27][28][29] .TDNN is composed of one-dimensional fully convolutional layers and causal convolution, but if we want to achieve an effective perception of long sequence data, an extremely deep network or a large convolution kernel is a necessity.To solve this problem, TCN adds dilated convolutions to achieve an exponential receptive field by inserting 0 taps between the taps of the convolution kernel.
Figure 1 shows the principle of dilated casual convolution of TCN.By using 3-layer one-dimensional casual convolution with dilation factors d=1,2,4 and filter size k=3, every tap of the output layer achieves a receptive field of 15 input data.
Specifically, for a given input 1D time-series data x Î  n and convolution kernel f:{0k -1}, a single dilated convolution operation on sequence element s is: where d is the dilation factor, k is the convolution kernel size, f(i) is the i-th tap of the convolution kernel, and x sdi is the data in the sequence corresponding to the casual convolution kernel tap, whose sample interval between taps is d.Therefore, the dilated convolution is to add a fixed step interval to the adjacent convolution kernel taps.Specially, on the condition of dilation factor d = 1, the dilated convolution is equivalent to a normal full convolution.In addition, to ensure the effective transfer of temporal information, TCN introduces additional residual connections.The causal convolution ensures that the convolution is carried out from the past to the future, and future data will not be introduced into the historical data; the residual connection conducts 1*1 convolution for the input data and adds the dilated convolution data to the output to realize cross-layer information transfer [30] .As a result, more historical details are obtained to improve model accuracy.Figure 2 below shows us the residual connection structure of TCN.

Advantages of TCN are as follows:
Strong parallelism: Convolutional neural networks adopt the same convolution kernel in each layer, and long input sequences can be processed in parallel as a whole.
Flexible receptive field: The receptive field can be flexibly changed by stacking dilated convolution layers, increasing the dilation coefficient, and enlarging the convolution kernel.
Gradient stability: Different layers have different parameters and gradients and will not cause gradients to explode or disappear due to parameter sharing like RNN.
Low memory requirement: No memory unit is required except the convolution kernels, and convolution kernels are shared among the same layers, contributing to low memory requirement.
Variable input length: Input data is received by sliding one-dimensional convolution, and zero data can be automatically padded when the input sequence length is insufficient, so any data at any length can be received.
In the industrial process, a random occurrence of fault means a unfixed fault sequence length, and TCN can process these time series fault data of different lengths flexibly.Due to the parallelism and lower memory requirements, the TCN model  s training requires fewer resources, which means less training time.In addition, the residual connection of TCN can better convey historical information of long-term series data to learn more features.

Multi-Head Self-Attention Mechanism
As input features increase, to obtain global feature correlation, the traditional convolutional neural network requires a very deep network, which will significantly increase model size, while the self-attention mechanism can directly obtain the global feature correlation and assign a higher weight to important information.Selfattention mechanisms require fewer parameters than other neural networks.
The essence of the self-attention mechanism is to map the input matrix X ={x i } i Î(123T) into a query matrix Q ={q i }, a key matrix K ={k i }, and a value matrix V ={v i } i Î(123T) through the matrix W Q , W K , W V [31] .By multiplying the mapped matrix Q with K T and the following softmax normalization , we obtain the corresponding weight coefficient k i for v i , and then v i can be weighted and summed to the attention result, as shown in formula (2): where d k is used to prevent the gradient from disappearing, for the result of matrix multiplication is too large.In practice, single-head self-attention often pays too much attention to itself and omits detailed information, so the multi-head self-attention mechanism [13] is proposed to solve this issue.Compared with the single-head one, the multi-head self-attention can extract information at different levels, effectively improving model diagnostic performance.
Multi-head self-attention uses multiple sets of selfattention to process the input sequence, then concatenates the results and performs a linear transformation to Fig. 2 The residual connection of TCN output.Take i head self-attention as an example, its calculation process is shown in formula (3): where h i is the result of i-th head self-attention, and In formula (4), , which maps corresponding Q, K, and V matrix into query matrix Q i , key matrix K i and value matrix V i of the i-th head.
In our model, the self-attention layer is connected to the one-dimensional convolution layer to extract important features.By mapping features into corresponding query, key, and value matrix, calculating the correlation weight between each feature, and performing a weighted summation to obtain the final weighted timeseries signal, we get the final important features to recognize different faults.

Improvement of TCN
The original TCN network uses the ReLU activation function, but the output of it is zero when the input is negative, which may lead to neuron death.Therefore, we use the alternative LeakyReLU activation function to replace the ReLU function in order to give a minimal gradient α when the input is negative, which can effectively avoid neuron death and accelerate the model to converge simultaneously, as is shown in formula ( 5) and (6).

Model Structure
In this paper, the one-dimensional convolution and self-attention mechanism are introduced after TCN for improvement, and network structure is shown in Fig. 3.As we can see, our model mainly consists of TCN layers, a one-dimensional convolution layer and a selfattention layer.
To start with, we introduce the batch normalization for each feature to accelerate model fitting and then apply the four-layer TCN, whose activation function has been replaced by the LeakyReLU function, to each fea-ture.The second part is the 1DCNN layer, applying two channels of 1DCNN to extract the local correlations of different features on each time step.The third part is the self-attention layer, which further extracts key features from the extracted local correlations and puts the result in the final fully connected layer for one-hot classification.Weight normalization is used in the TCN layer, selfattention layer and fully connected layer to speed up model fitting.
Specifically, take the input data format [52, 200] as an example, as is shown in Fig. 4, where 52 is the feature number, and 200 the sampling length of time-series data.Our model first uses four channels of TCN to extract temporal correlations for each feature, which changes the data format to [208, 200], and then sequentially uses one-dimensional convolution and multi-head self-attention for the feature data on each sample.Finally, a fully-connected network is connected after to classify and output the one-hot vectors of 21 operating states.

Parameter Settings
For the TCN layer, we use a TCN for 52 variables.The TCN has four layers, and the dilation coefficients of each layer are 1, 2, 4 and 8, respectively and a 4-channel convolution is used in the first block, a 1-channel convolution in the remaining second, third and fourth layer.The kernel size is set to 9, and the stride is 1.The structure of the TCN layer is shown in Fig. 5.
The one-dimensional convolution layer uses twochannel convolution on each time step.The kernel size is 8, and the stride is 1.The 512 hidden units of the multi-head self-attention layer are divided into 4 heads, with 21 output dimensions in the fully connected layer connected behind, which is used to classify the 21 operating states of the device.We set the respect dropout of

Lab Environment and Training Process
The hardware environments are as follows: the CPU is Intel i7-4710MQ, the GPU is NVIDIA GeForce GTX 860M GDDR5 2GB, and the RAM is DDR3L The training steps of our model are as follows: 1) Data normalization.
2) Break up the data set randomly, generate the training set and validation set from training subset with the ratio of 8:2, and convert the classification labels into one-hot vector form.
3) Initialize each layer of the model, set the optimization algorithm to Adam, the initial learning rate is 0.001 and is reduced on plateau scheduler, set the maximum number of training epochs to 50, and use crossentropy loss function; 4) Save the model state of each epoch, and take the state with the smallest loss in the validation set as the best state.

Experiments and Discussion
In order to illustrate the model  s diagnostic ability on multi-source time series faults, the model  s performance is evaluated on the Tennessee Eastman Process (TEP) dataset.

Tennessee Eastman Process
TEP is a simulated multivariate time series dataset created by Eastman Chemical Company [32] , which has nonlinear characteristics such as strong coupling and time variation [33] , and is a widely used index to evaluate the models fault diagnosis ability of complex industrial process.This simulation process is introduced by Downs et al [34] and optimized by Ricker et al [35] , and consists of five components, including reactor, condenser, compressor, stripper, and separator, and provides 52 features, including 41 process measurements and 11 manipulated variables.The optimized TEP proposed by Bathelt et al [36] is shown in Fig. 6.The model has 21 operating states, including one normal state and twenty fault states, and the description of these states is shown in Table 1.The standard TEP dataset can only generate limited examples for each fault, which is insufficient for training deep learning models.Therefore, we use the extended TEP dataset proposed by Rieth et al [37] , who uses random seeds to generate more examples for each state.As the running state was sampled every three minutes, the training example was sampled 25 hours, i.e., 25 * 60/3 = 500 samples, and the fault state starts after an hour; the test examples were sampled 48 hours, i. e., 48 * 60/3 = 960 samples, and the fault state starts after eight hours.
In this paper, we use the training subset of raw dataset to generate the training set and validation set, the ratio of whom is 8:2, and the test set is generated from the whole test subset.The fault sampling length used in our dataset is 200, that is, the 21-220 sampling point data of the training subset is used in the training set and the validation set, and the 161-360 sampling point data of the test subset is used in the test set.

The impact of different activation functions
To demonstrate what the effect of different activation functions in TCN layer has on our model, an experiment comparing the model  s performance using Leaky-ReLU or ReLU activation functions in TCN layer is conducted.The experiment results in Table 2 shows us the salient advantages the TCA model using the LeakyReLU activation function has over that using the ReLU activation function.
The train process of TCA model using LeakyReLU function and ReLU function is shown in Fig. 7.It is obvious that compared with the ReLU one, the model using LeakyReLU function can get more stable convergence and higher accuracy.

The impact of different modules in our model
In this paper, the model is improved by introducing 1DCNN and self-attention on the basis of TCN.To illustrate the validity of different module in the model, the 1DCNN and attention module are eliminated respectively to obtain four different models: TCN, TCN+Attention, TCN+1DCNN, TCA, of which the parameters in 1DCNN layer and attention layer remain unchanged, corresponding to that in TCA.All models take the epoch with the smallest loss on the validation set as the best epoch.
As shown in Table 3, the addition of 1DCNN and self-attention can enhance the ability to detect failures, improving recognition accuracy, and reducing training loss.Still, added modules will also increase epoch time.
The TCA model with both 1DCNN and self-attention has the longest training time per epoch, but the highest accuracy and smallest loss.
Specifically, take the results of the above model on

The impact of different convolution kernels in the 1DCNN layer
To explore the performance of TCA model with different convolution kernels in the 1DCNN layer, an experiment is conducted.The kernel size of 1DCNN layer is set to 3-16, and the TCN and attention parts remain unchanged.The performance of different kernels is shown in Table 4.
From Table 4, as the kernel size increases, the training time per epoch increases as well, for a single convolution has more calculation operations, the training time increases about 29.5 s per epoch from kernel size 3 to 16.The performance of models with different kernels  has the worst performance on the validation set, which is inferior to the TCN+Attention model by 0.5%.The commonly used 3 and 5-size kernels can improve the models performance, with the accuracy being 95.07%and 94.91% respectively, which do not reach the rate of 96% and it is inferior to the model with the kernel size of 6, 8 and 16.Among all these models, the model with kernel size 8 achieves the best accuracy of 97.14%, and the best loss of 0.063.Through the above experiments, we have explored the effects of replacing the activation function and adding 1DCNN or self-attention layer on the performance of our model.Further, the performance of the model using different convolution kernels is studied.Results in Table 2 show that the activation function replaced to LeakyReLU in TCN can effectively improve the model performance.Results in Table 3 show the effectiveness of the additional 1DCNN layer and attention layer, and the model with both the 1DCNN layer and attention layer shows the best performance.Results in Table 4 export the performance of the model with different kernel sizes in the 1DCNN layer and it is found that in most cases, the 1DCNN layer shows its effectiveness, except for that with a kernel size of 13, with the accuracy being 92.50%, which is even lower than the TCN+Attention (93.00%) model without the 1DCNN layer.The performance of other models is improved by 0.54%-4.14%,compared with the TCN+Attention model.As a result, the 1DCNN layer in our model proves its effectiveness.

Comparison of TCA model and other neural networks
Several commonly used neural networks, including recurrent neural networks like LSTM, GRU, and Transformer, are selected to compare with our model.There are 2 or 4 layers in LSTM and GRU respectively, and  the 2-or 4-layer model is marked with "2L" or "4L".
Since the bidirectional RNN model might leak future samples, the RNN models here are unidirectional models.The Transformer model stacks 6 encoders.All these RNN and Transformer models have 52 dimensions in the input layer and 128 dimensions in hidden layers, and additional attention layers are added to RNN models separately, with the mark of "+A" at end, and the hidden units of these attention layers are set to 512, which are the same as those in the TCA model.To speed up the model fitting, the dropout of above models is set to 0.4.
The performance of all models is shown in Table 5.
In Table 5, we can see that RNN models have natural advantages in processing time-series data; they can achieve good results with small number of parameters and short training time.Stacking more layers of RNN does not significantly enhance models performance, but can directly lead to the training time to increase.The addition of the attention can effectively improve the model performance.As a result, the performance of RNN models with an attention layer is improved by 1.97%-4.22%.The GRU2L+A model achieves the best accuracy of 97.46% and the smallest loss of 0.057 8 on the validation set.The accuracy of Transformer model composed of encoders on the validation set achieves 92.80%, which is lower than that of the RNN models with the attention, indicating that the RNN network and the attention can complement each other.
The accuracy on the validation set of TCA model proposed in the paper achieves 97.14%, which is slightly inferior to that of the GRU2L+A model by 0.32% but surpasses those of the Transformer model and other RNN models.However, due to the extensive use of convolution, the average epoch time of the TCA model reaches 162.8 s, which is 2.78 times that of the GRU2L+ A model.

Comparison of generalization ability of TCA and other models on the test set
An extra experiment is conducted on the test set to test the generalization ability of above models, and the widely used F1 score [38,39] is used as the evaluation indicator.Formula (7) shows the method for calculating the F1 score: The F1 score, accuracy, and loss of the above models on the test set are shown in Table 6.
As shown in Table 6, among all the 21 faults, our TCA model achieves the best accuracy of 94.27%, loss of 0.331 9 and the average F1 score of 0.9405 on the test set, and achieves the best F1 score on 16 faults across all 21 faults.Among the most difficult faults 3, 9 and 15, the respect F1 score of TCA also achieves 0.752 8, 0.900 7, and 0.769 7, which is the best F1 score for faults 9 and 15.The confusion matrix in Fig. 9 shows the detailed result clearly that our model can accurately identify most faults, especially for fault 9 and 15, whose accuracy reaches 93% and 71%, respectively.Although the GRU2L+A model performs better than the TCA model on validation set, its generalization ability is slightly worse than that of the TCA model on the test set.In all 21 faults, the F1 score of the TCA model is not inferior to that of GRU2L+A model for 20 faults, and is superior to that for 9 faults, but substantially inferior to that for fault 5.The confusion matrix can show the identification result of fault 5 clearly, though accurately our model can identify fault 5, the misidentification of large numbers of fault 3 samples as fault 5 results in a low F1 score of it.Meanwhile, the matrix shows that the recognition ability of our model between fault 0 (normal state) and fault 15 still needs to be strengthened; about 27% fault 0 states are recognized as fault 15 and 14% fault 15 as fault 0, demonstrating the main reason of accuracy loss; 29% fault 5 as fault 3 is another cause.However, our model still achieves the best F1 score of 0.661 4 on fault 0, indicating the most robust discrimination ability under a normal state and fault condition.In addition, the model is tested on the partners diesel engine failure dataset, on which project this paper relies.Failures such as reduced compressor efficiency, extended combustion duration and reduced fuel injection of diesel engines can lead to a slow decrease in the output power of diesel engine and aggravate the wear of diesel engine parts, which reduces the operational stability.We use the collected time-series sensors data as model input.The results turn out that the model can quickly and effectively detect faults compared with the manual detection method.

Conclusion
For complex multivariate industrial process faults that are strongly coupled and time-varying, a TCA model based on TCN together with 1DCNN and multi-

Fig. 7
Fig. 7 The train process of model using LeakyReLU function and ReLU function

F1 = 2
´precision × recall precision + recall (7) where precision = TP TP + FP , indicating the percentage of true positive samples in all positive samples tested, and recall = TP TP + FN , indicating the percentage of true positive samples in all positive samples.
attention is proposed, and we further improve the model by replacing the activation function of TCN.The introduction of 1DCNN can effectively extract the local correlations of multivariate, and the following multi-head self-attention can automatically assign higher weights to important features.The experiment results validates its effectiveness.