Issue 
Wuhan Univ. J. Nat. Sci.
Volume 27, Number 6, December 2022



Page(s)  521  530  
DOI  https://doi.org/10.1051/wujns/2022276521  
Published online  10 January 2023 
CLC number: TP 391
SelfSupervised Time Series Classification Based on LSTM and Contrastive Transformer
School of Electronic and Information Engineering, Tongji University, Shanghai 201804, China
^{†} To whom correspondence should be addressed. Email: tjcadzhaoxd@tongji.edu.cn
Received:
25
September
2022
Time series data has attached extensive attention as multidomain data, but it is difficult to analyze due to its high dimension and few labels. Selfsupervised representation learning provides an effective way for processing such data. Considering the frequency domain features of the time series data itself and the contextual feature in the classification task, this paper proposes an unsupervised Long ShortTerm Memory (LSTM) and contrastive transformerbased time series representation model using contrastive learning. Firstly, transforming data with frequency domainbased augmentation increases the ability to represent features in the frequency domain. Secondly, the encoder module with three layers of LSTM and convolution maps the augmented data to the latent space and calculates the temporal loss with a contrastive transformer module and contextual loss. Finally, after selfsupervised training, the representation vector of the original data can be got from the pretrained encoder. Our model achieves satisfied performances on Human Activity Recognition (HAR) and sleepEDF reallife datasets.
Key words: selfsupervised learning / contrastive learning / time series classification
Biography: ZOU Yuanhao, male, Master candidate, research direction: cloud manufacturing resource combination. Email: 2133062@tongji.edu.cn
Supported by the National Key Research and Development Program of China (2019YFB1706401)
© Wuhan University 2022
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0 Introduction
In real life, the application of time series data covers various aspects such as finance, manufacturing, weather forecasting, and biomedicine^{[1]}. To obtain feature in time series data, researchers in data mining have proposed a large number of algorithms, such as the Hidden Markov Model (HMM)^{[2]}, Linear Dynamical System (LDS)^{[3]}. However, with the continuous expansion of the application field, people's requirements for tasks such as time series data classification and prediction are constantly increasing and traditional algorithms may not meet the requirements.
In recent years, people have tried to use deep learning methods to perform data mining on time series data. However, time series data have high dimensionality and high cost in directly processing the data, so the use of representation learning to reduce the dimension of time series data has become an important direction of time series data research^{[4]}. Representation learning can reduce the dimensionality of time series data with few loss in features, thereby reducing the cost of processing the data^{[5]}. Another major feature of time series is the lack of labels, and time series data often exist in a form that is not easy for people to learn and recognize, which makes it difficult for traditional supervised deep learning methods to be applied to actual time series data classification^{[6]}.
Selfsupervised representation learning techniques have been widely used in the field of computer vision in recent years^{[7]}, which construct pseudolabel to help learn features. Doersch et al^{[8]} proposed a model which learned the hidden feature in photos by constructing a prediction task of different positions from the same photo. Gidaris et al^{[9] }obtained the latent space representation of the image by feeding the rotation of the same image at different angles into the model. After that, contrastive learning^{[10]} has become a hot direction in selfsupervised learning. Contrastive learning artificially constructs labels by performing different augmentation on the same data to obtain multiple different data with the same hidden feature. Using the encoder module, the augmented data is mapped into latent space and the representation vector is got. Augmented data from the same original data are considered positive samples, and the rest of the samples are considered negative samples. Then loss function is used to maximize the similarity of vectors in the positive sample and minimize the similarity of vectors in the negative samples. In this way, the encoder can represent the original data well by learning the hidden feature.
In this paper, an improved model based on the time series representation learning framework via Temporal and Contextual Contrasting (TSTCC) model^{[11]} is proposed to classify time series data. Compared with the original model, this method can improve the model's ability to get the feature in the frequency domain of time series data and represent global feature in time series data. The model outperformed the original model on some test sets. Specifically, the innovations of this paper are as follows:
1) We adopt the Fast Fourier Transform (FFT) as the way of data augmentation, which improves the representation ability of the model for frequency domain features and increases the robustness of the model.
2) We design a multilayer Long ShortTerm Memory (LSTM) encoder module using composite convolution downsampling, which can obtain both local and global features on the augmented data.
3) We propose a contrastive transformer module for hidden feature extraction. This module increases the similarity of hidden features between augmented data in positive samples, thereby improving the representation ability of the model.
1 Related Work
Due to the excellent performance of selfsupervised representation learning methods in the unsupervised domain and transfer learning, more and more scholars have applied selfsupervised representation learning to time series data analysis in recent years. Sarkar et al^{[12]} proposed a selfsupervised learning model for electrocardiogram(ECG) data analysis by applying six different transformations to ECG data and using the transformation to create labels. By dividing the deep neural network into a set of gradient isolation modules, Löwe et al^{[13]} used the InfoNCE loss to calculate the loss inside the module, and the greedy training method was used between the modules to maximally preserve the feature of its inputs. The Contrastive Predictive Coding (CPC) model proposed by Oord et al^{[14]} uses the autoregressive model to predict future data with latent space data. By introducing probabilistic contrastive loss and negative sampling methods, the model has good results in speech, images, and other datasets. Franceschi et al^{[15]} proposed a dilated causal convolutionbased encoder and a triple loss function based on timebased negative sampling, which treated segments from different time series as pairs of negative samples and subsegments from the same sample as positive sample pairs. This model was mainly used to classify time series data with inconsistent lengths. The Temporal Neighborhood Coding (TNC) model proposed by Tonekaboni et al^{[16] }solved the problem of nonstationary time series classification. This model defined a time neighborhood. The samples in the neighborhood are considered positive samples, and the samples outside the domain are considered negative samples. To solve the problem that similar samples are located outside the time neighborhood caused by periodic time series, the author used positiveunlabeled learning to introduce a negative sample weight as a supplement to the loss function. The TSTCC model proposed by Eldele et al^{[11]} treated the data with two different augmentations as a pair of positive samples, mapped the augmented data to the latent space through the convolutional layer, and additionally learned the feature from latent space through the Transformer module. For further learning, the temporal loss function was calculated internally by the positive sample pair in a way similar to CPC^{[14]}. The InfoNCE loss was used to calculate the loss between the positive and negative samples as a contextual loss.
The common problem of the above methods is that the time series data is only represented from the features in the time domain, but not from the perspective of the frequency domain. What's more, TSTCC does not introduce more global features in the encoder stage, resulting in fewer global features obtained in the latent space, which affects the effect of the model.
2 Methods
The model proposed in this paper is an improvement to the TSTCC^{[11]} model, and its architecture is shown in Fig. 1. The data is first augmented in two different ways, then mapped to the latent space through encoder to get a representation vector. The representation vectors from the same sample then calculate the temporal loss through the contrastive Transformer module. The result of the Transformer encoder is used as the contextual feature of the representation vector from the encoder. The contextual features from the same sample are considered positive samples, and the other features in the same batch are considered negative samples. The contextual loss is calculated by using the InfoNCE loss function with positive samples and negative samples. After learning for several epochs, the encoder module can transfer data into a representation vector with hidden features.
Fig. 1 Architecture of our model 
2.1 FFT Data Augmentation
Data augmentation is a commonly used way to generate positive sample pairs in contrastive learning. However, at present, most of the enhancement methods for time series data use the augmentation method in the time domain and rarely in the frequency domain^{[17]}. In order to obtain the both hidden features in the frequency domain and time domain, an augmentation method based on FFT is adopted in this paper. And in order to highlight the hidden feature, the paper also combines the FFT augmented data with weak augmented data into a positive sample pair to facilitate the comparison between them. This augmentation method also enhances the robustness of the model.
This paper performs two different augmentation operations on the original data X∈$\mathbb{R}$^{dim}, X={x_{1}, x_{2},…, x_{n}}. We denote X^{F}as the augmented sample with FFT augmentation, and X^{W} as the augmented sample with weak augmentation. The FFT augmentation method used in this paper is to first perform a permutation operation on the data, that is, randomly split data with maximum segments M and shuffle them, and then perform a FFT on the data to transfer it to the frequency domain. Scaling and warping operations are performed to the frequency domain data, that is, randomly warp and scale the data in the frequency domain, and the corresponding warping function follows a beta(α, α) distribution. Then the data is converted back to the time domain. For weak augmentation, only random Gaussian noise is added to it. Figure 2 shows the difference between augmentation in the time domain and the frequency domain. The scaling ratio is 2 and the parameter α is 0.5. The data is about the accelerometer of the volunteer.
Fig. 2 Augmentation in time or frequency domain 
For time series, the characteristics of the sequence change over time, and the change in frequency in some tasks is so closely related to the task that working in the frequency domain is more useful than that in the time domain^{[5]}. The advantage of adding frequency domain augmentation is that the feature in the frequency domain is considered without destroying the hidden features of the data, which is beneficial to further classify data with highfrequency similarity. As can be seen from Fig. 2, the augmentation based only on the time domain has serious feature loss in the two time periods of 1020 ms and 4050 ms, while the augmentation based on the frequency domain well preserves the change trend of the original data although the values are different.
2.2 MultiLayer LSTM Encoder
The encoder module used in the TSTCC^{[11]} model is a 3block convolution architecture, but the representation vector obtained by a simple 3block convolution is difficult to contain the global feature of the time series. After referring to the downsampling method of Informer^{[18]}, this paper designs the encoder architecture as shown in Fig. 3. This module obtains local feature of time series data through convolution blocks and global feature through the LSTM module, and downsampling operation can reduce the processing time of data with less feature loss. Meanwhile, the LSTM module can well protect the time series logic of the data.
Fig. 3 Encoder with 3layers LSTM 
The function of the encoder module is to map the augmented highdimensional time series data to the latent space $\mathbb{R}$^{d} of dimensional d to obtain hidden vectors Z = f_{encoder}(X), Z∈$\mathbb{R}$^{d}. Suppose Z={z_{0}, z_{1}, …, z_{T}}, and its total length is T. Also suppose Z^{F} is the representation vector of FFT augmented data and Z^{W} is the representation vector of weak augmented data.
The time series data T in real life is generally large, and the 3block convolution method can only collect its local feature, while the LSTM module can solve the problem of longterm data dependence and effectively improve the model's ability to obtain global feature. So adding a singlelayer LSTM to process the downsampling resulting from the convolution block can improve the mapping result of the encoder.
The output of the encoder is the representation vector of augmented data. It will also be used to transform the original data into representation vectors when making final classification predictions after selfsupervised learning is complete. One Fully Connected (FC) layer is used to predict the class to which the representation vector belongs.
2.3 Contrastive Transformer
Transformer has been widely used in computer vision^{[19]} and nature language processing fields^{[20]} in recent years. It can effectively collect local and global feature in data, and use attention mechanisms to help predict future input. Therefore, Transformer shows good performance in translation^{[20]}, and this paper tries to use this property for time series data processing.
Compared with other models based on contrastive prediction such as CPC^{[14]} and TSTCC^{[11]}, our model utilizes the hidden feature of the predicted data segment to a greater extent. Those models only use the contextual feature of the previous piece of data to predict the next piece of data, but our proposed contrast Transformer obtains both the contextual and local feature of the previous piece of feature. At the same time, the time continuity of the later piece of data is preserved in the prediction stage. Figure 4 shows the architecture of the contrastive Transformer module.
Fig. 4 Contrastive Transformer architecture 
Like TSTCC^{[11]}, this paper divides the time series into two parts: the first part is 0 to t items of the sequence{z_{0}, z_{1}, …, z_{t}}, and the second part is t+1 to t+k items of the sequence Z_{output} = {z_{t+}_{1}, z_{t+}_{2}, …, z_{t+k} }. Suppose c^{t}is the contextual vector and add it to the first part, so the first part Z_{input } = {c^{t} , z_{0}, z_{1}, …, z_{t} }. In this way, when the transformer encoder is learning, c^{t} can learn the attention relationship between itself and other elements in Z_{input}, and these relationships can represent the context feature of z. However, TSTCC^{[11]} uses c^{t}to predict data in Z_{output}like the CPC^{[14]} model, which hardly takes use of the hidden features in the Z_{output}.
This paper proposes a Contrastive Transformer module. Firstly, the representation vector Z_{input} is used as the input of the Transformer encoder f_{TransEncoder}(·) to get contextual vector c^{t}and hidden feature Z_{hidden} as follows:
$\{\begin{array}{l}{\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{W}}={f}_{\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{s}\mathrm{E}\mathrm{n}\mathrm{c}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{r}}({\mathit{Z}}_{\mathrm{i}\mathrm{n}\mathrm{p}\mathrm{u}\mathrm{t}}^{\mathrm{W}})\\ {c}_{t}^{\mathrm{W}}={\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{W}}[\mathrm{0}]\\ {\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{F}}={f}_{\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{s}\mathrm{E}\mathrm{n}\mathrm{c}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{r}}({\mathit{Z}}_{\mathrm{i}\mathrm{n}\mathrm{p}\mathrm{u}\mathrm{t}}^{\mathrm{F}})\\ {c}_{t}^{\mathrm{F}}={\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{F}}[\mathrm{0}]\end{array}$(1)
Since Z^{F} and Z^{W} are essentially the results of two different augmented data maps from the same sample X, they have similar hidden features. Suppose Z^{W}_{hidden}is the hidden feature get by the Transformer encoder and Z^{F}_{output} is the vector that needs to be translated. Take both of them into the Transformer decoder, f_{TransDecoder}(·), and get the result P^{W}. In the same way, we can get P^{F}:
$\{\begin{array}{l}{\mathit{P}}^{\mathrm{F}}={f}_{\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{s}\mathrm{D}\mathrm{e}\mathrm{c}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{r}}({\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{F}},{\mathit{Z}}_{\mathrm{o}\mathrm{u}\mathrm{t}\mathrm{p}\mathrm{u}\mathrm{t}}^{\mathrm{W}})\\ {\mathit{P}}^{\mathrm{W}}={f}_{\mathrm{T}\mathrm{r}\mathrm{a}\mathrm{n}\mathrm{s}\mathrm{D}\mathrm{e}\mathrm{c}\mathrm{o}\mathrm{d}\mathrm{e}\mathrm{r}}({\mathit{Z}}_{\mathrm{h}\mathrm{i}\mathrm{d}\mathrm{d}\mathrm{e}\mathrm{n}}^{\mathrm{W}},{\mathit{Z}}_{\mathrm{o}\mathrm{u}\mathrm{t}\mathrm{p}\mathrm{u}\mathrm{t}}^{\mathrm{F}})\end{array}$(2)
If the representation vector learned well in encoder, P^{W} and P^{F} should have the same hidden feature as Z^{W}_{hidden} and Z^{F}_{hidden}, respectively. So the temporal loss function L_{T} can be constructed.
$\{\begin{array}{l}{L}_{\mathrm{T}}^{\mathrm{F}}=\frac{\mathrm{1}}{K}{\displaystyle \sum _{k=\mathrm{1}}^{K}}\mathrm{l}\mathrm{o}\mathrm{g}\frac{\mathrm{e}\mathrm{x}\mathrm{p}(({P}_{t+k}^{\mathrm{F}}{)}^{\mathrm{T}}{z}_{t+k}^{\mathrm{F}})}{{\sum}_{n\in [t,k]}\mathrm{e}\mathrm{x}\mathrm{p}(({P}_{t+k}^{\mathrm{F}}{)}^{T}{z}_{n}^{\mathrm{F}})}\\ {L}_{\mathrm{T}}^{\mathrm{W}}=\frac{\mathrm{1}}{K}{\displaystyle \sum _{k=\mathrm{1}}^{K}}\mathrm{l}\mathrm{o}\mathrm{g}\frac{\mathrm{e}\mathrm{x}\mathrm{p}(({P}_{t+k}^{\mathrm{W}}{)}^{\mathrm{T}}{z}_{t+k}^{\mathrm{W}})}{{\sum}_{n\in [t,k]}\mathrm{e}\mathrm{x}\mathrm{p}(({P}_{t+k}^{\mathrm{W}}{)}^{\mathrm{T}}{z}_{n}^{\mathrm{W}})}\end{array}$(3)
At the same time, c_{t}^{W} and c_{t}^{F} from the same sample are regarded as a positive sample pair c_{t}^{+}, and other context features from different samples are regarded as a negative sample pair c_{t}^{}. The context loss function L_{C} can be constructed, and the overall loss is calculated as shown in Eq. (4) where τ is the temperature parameter, λ_{1} and λ_{2} are the weight parameters, and Sim_{cos}(·) means cosine similarity function.
$\{\begin{array}{l}{L}_{\mathrm{C}}={\displaystyle \sum _{i=\mathrm{1}}^{N}}\mathrm{l}\mathrm{o}\mathrm{g}\frac{\mathrm{e}\mathrm{x}\mathrm{p}(\mathrm{S}\mathrm{i}\mathrm{m}{}_{\mathrm{c}\mathrm{o}\mathrm{s}}{}^{}({c}_{t}^{+})/\tau )}{{\displaystyle \sum _{m=\mathrm{1}}^{\mathrm{2}N}}\mathrm{e}\mathrm{x}\mathrm{p}(\mathrm{S}\mathrm{i}\mathrm{m}{}_{\mathrm{c}\mathrm{o}\mathrm{s}}{}^{}({c}_{t}^{})/\tau )}\\ L={\lambda}_{\mathrm{1}}({L}_{\mathrm{T}}^{\mathrm{F}}+{L}_{\mathrm{T}}^{\mathrm{W}})+{\lambda}_{\mathrm{2}}{L}_{\mathrm{C}}\end{array}$(4)
Combining the above modules, the method flow of our proposed model is shown in algorithm 1.
Algorithm 1 Our model  

Input:  Time series data X, model hyperparameter: output length k, weight λ_{1} and λ_{2} 
Output:  The learned multilayer LSTM encoder 
1:  Randomly initialize the multilayer LSTM encoder as LSTMencoder() 
2:  Randomly initialize the contrastive Transformer module as TransformerEncoder() and TransformerDecoder() 
3:  Get augmentation data by X^{F} = FFT_augmentation(X) and X^{W} = weak_augmentation(X) 
4:  For epoch = 1 to Maximum do 
5:  Representation voctor Z^{F} = LSTMencoder(X^{F}) and Z^{W} = LSTMencoder(X^{W}) 
6:  Randomly initialize the variable t 
7:  Divide vector Z with variable t : Z^{W}_{input} = Z^{W}[0:t+1], Z^{F}_{input} = Z^{F}[0:t+1] Z^{W}_{output} = Z^{W}[t+1:t+k], Z^{F}_{output} = Z^{F}[t+1:t+k] 
8:  Get hidden feature Z_{input} and contextual feature C_{t} C^{W}_{t},Z^{W}_{hidden}= TransformerEncoder(Z^{W}_{input}) and C^{F}_{t }, Z^{F}_{hidden}= TransformerEncoder(Z^{F}_{input}) 
9:  The decoder result P gets by contrastive Z P^{W} = TransformerDecoder(Z^{W}_{hidden}, Z^{F}_{output}) and P^{F} = TransformerDecoder(Z^{F}_{hidden}, Z^{W}_{output}) 
10:  Get temporal loss L^{W}_{T} by P^{W}, Z^{W}_{output} and L^{F}_{T} by P^{F}, Z^{F}_{output} 
11:  Get contextual loss L_{C} by positive sample C^{W}_{t} and C^{F}_{t} and negative sample in X 
12:  Total loss L = λ_{1} * (L^{W}_{T} + L^{F}_{T}) + λ_{2} * L_{C} 
13:  Update model by L 
14:  Get trained multilayer LSTM encoder 
3 Experimental Results
3.1 Experimental Environment
1) Dataset description
Human Activity Recognition (HAR): HAR^{[21]} dataset contains embedded inertial sensors data from about 30 subjects doing 6 classes of Activities of Daily Living (ADL): walking, walking upstairs, walking downstairs, sitting, standing, and laying. Since there are sensors on each person recording data, the channel of time series data is 9 and the length of time series data is 128.
SleepEDF: A dataset about Electroencephalogram(EEG) signals in PhysioBank^{[22]}. This dataset contains data from two experiments: One is the effect of age on sleep, and the other is the effect of temazepam on sleep. Five classes are used to represent the subject's sleep state: wake, nonrapid eye movement which has three substates, and rapid eye movement. The length of the time series is 3 000.
Epilepsy: A dataset of surface EEG recordings from healthy volunteers with eyes closed and eyes opened^{[23]}. It is divided into two categories, with epilepsy and without epilepsy. The length of the dataset is 5 120.
2) Running environment
The running environment is the same as TSTCC^{[11]}. The data is divided into training set, validation set, and test set according to the ratio of 3:1:1. The epoch of the selfsupervised training is 100. We used Adam optimizer with a learning rate of 3E^{4}, weight decay of 3E^{4}, β1 = 0.9, and β2 = 0.99. Batch size is 128, the maximum segment M in HAR is 8, sleepEDF is 12, and Epilepsy is 5. The temperature parameter is 0.2, and the number of contrastive Transformer layers is 4. The weight parameter λ_{1} is 1 and λ_{2} is 0.7. The parameters α of the beta distribution in the FFT augmentation module in HAR is 0.5, sleepEDF is 0.5, and Epilepsy is 1. We have run our model on Pytorch 1.10.1 with CUDA 11.3 and it is trained on NVIDIA GeForce RTX 3060 GPU.
3.2 Overall Performance and Discussion
We test our model on HAR, sleepEDF, and Epilepsy datasets, and compare the accuracy rate between SSLECG^{[12]}, CPC^{[14]}, SimCLR^{[24]}, and TSTCC^{[11]}. After getting the representation vector from selfsupervised learning, we will use one FC layer to classify time series data and calculate the accuracy. The model will be trained 5 times with 5 seeds and show the mean and standard deviation. The records of other models all come from Eldele^{[11]}. The accuracy of the models is shown in Table 1.
The average accuracy rate can be used to represent the model's ability to represent and classify the data. As can be seen from Table 1, on the HAR and sleepEDF datasets, the average accuracy of our model reaches 91.13% and 83.69%, which is higher than the average accuracy of all other models. However, on the Epilepsy dataset, our model accuracy is 96.98%, which is lower than 97.23% of the TSTCC model, but higher than other models.
The standard deviation of the accuracy rate reflects the stability of the model under different random numbers. The smaller the standard deviation of the accuracy rate, the more stable the model, and the less the random number interferes with the model. Our model has the smallest standard deviation on both the HAR and sleepEDF datasets, 0.31 and 0.17. Especially on the sleepEDF dataset, the standard deviation of the accuracy of our model is much lower than that of other models. The above experiments show that our model has high accuracy, good stability and robustness in most cases.
In order to more clearly reflect the effect of time series data representation, the tdistributed stochastic neighbor embedding(tSNE)^{[25]} method maps the representation data from d dimensions to 2 dimensions to make a scatter plot as shown in Fig. 5.
Fig. 5 Representation vector visualization on HAR using tSNE 
In Fig. 5, the closer the points of the same class, the higher the degree of representation of the model for the class, the more obvious the point features of the class for distinguishing from other categories, indicating that the representation vector contains more hidden features. It can be seen from Fig. 5 that the fifth class is the easiest to distinguish, and the third and fourth are the most difficult. Our model has certain advantages over TSTCC in classifying class 1 because our class 1 is relatively farther away from classes 2 and 3 showing that our representation vector works better. Table 2 shows the precision and recall of our model and TSTCC model for each class on the HAR dataset with selfsupervised learning, verifying the effect of the tSNE method.
It can also be seen from Table 2 that the sixth class has the highest precision and recall rate, While the fourth and fifth categories are very low, which is also consistent with the results of tSNE. From the recall rate, the first class of HAR can be classified well with other classes. In the tSNE scatter plot, our first and second classes are close to spherical and denser than the TSTCC model, so the recall rate of these two classes in Table 2 is also higher.
Different model accuracy on three datasets (unit:%)
Precision and recall rate of our model and TSTCC model on HAR with selfsupervised learning (unit:%)
3.3 Test on Encoder Performance
Higher accuracy rate on the data set means a better representation ability of the feature vector for the original data, and better effect of the encoder. In this experiment, to verify the effectiveness of the encoder module, we compare the effects of selfsupervised and supervised training modes of encoder module of our model and TSTCC model on the accuracy of the results of the three datasets: HAR, sleepEDF, and Epilepsy. Selfsupervised training first learns the representation vectors through the encoder, and then classifies the feature vectors through one FC layer. When the FC layer is trained, the parameters in the encoder are frozen, that is to say, the representation vector remains unchanged, and only the FC layer is trained. However, in supervised training, the encoder is trained with the FC layer. The result is shown in Table 3.
As can be seen from Table 3, in supervised learning, the accuracy of our model is higher than that of the TSTCC model on all three datasets, which indicates that the encoder module of our model has stronger representation ability than the encoder of TSTCC model. When applied to the unsupervised domain, our model still outperformed the TSTCC model in terms of accuracy on HAR and sleepEDF datasets. From the above comparison, it can be seen that the encoder in this paper provides better representation in both unsupervised and supervised domains. Table 4 specifically reflects the classification performance of the encoder for each class of the HAR dataset with supervised learning.
As can be seen from Table 4, the multilayer LSTM encoder module can learn more hidden information about the first type of data. In Table 4, the accuracy of our model on the first class is much higher than that of TSTCC indicating the higher ability of the proposed multilayer LSTM encoder module in learning more hidden information about the first type of data, which is consistent with the result of Table 2. Moreover, we can find that supervised learning always represents higher classification accuracy in each class than unsupervised learning in our model and TSTCC model. This may be due to the fact that the encoder used in the unsupervised field is easy to ignore some details that can only be noticed by supervision.
In order to intuitively reflect the changing trend of the accuracy rate when the pretrained representation vectors of the two models classify the time series data, the test loss and accuracy rate on the test set are shown in Fig. 6. As can be seen from Fig. 6(a), the TSTCC model converged faster than our model during the training process. The converge starts in the 8th epoch for TSTCC model, but 20th epoch for our model. Furthermore, the training loss of our model is smaller than that of TSTCC model from the fifth epoch. This may be because the representation vector we learn contains more hidden feature, and it is more difficult for the FC layer to converge. From Fig. 6(b), The accuracy curves of the two models become flat from the fifth epoch, indicating that the representation features obtained from the two models learn the differences between the classes of the dataset. The accuracy curves of the two models start to flatten out at the fifth epoch, which indicates that the FC layer can easily classify most of the data correctly, and also indicates that the encoder module learns many classification features of the data. Meanwhile, our accuracy curves are always higher than the TSTCC model, which indicates that our model can learn better than the TSTCC model in the same short time as TSTCC does.
Fig. 6 Test loss and test accuracy of our model and TSTCC 
Encoder performance on TSTCC and our model %
Precision and recall rate of our model and TSTCC model on HAR with supervised learning (unit:%)
3.4 FFT Augmentation and Contrastive Transformer Sensitivity Testing
This experiment tests the performance of the model under different hyperparameters on the HAR dataset, which includes the parameters α of the beta distribution in the FFT augmentation module and the weight parameters λ_{1} and λ_{2}. As can be seen in Eq. (4), λ_{1} and λ_{2} respectively represent the influence of temporal loss and contextual loss in loss function of the contrastive Transformer module. The result is shown in Fig. 7.
Fig. 7 Hyperparameter experiment result of parameters α (a), λ_{1}(b), λ_{2}(c) 
As can be seen from Fig. 7(a), the accuracy fluctuates above the dotted line (90.31%, TSTCC) with the change of parameters α, meaning the parameters α of the beta distribution in the FFT enhancement have less influence on the overall experimental results. From Fig. 7(b), when the value of λ_{1} is small, the accuracy curve tends to level off and remains above 90%, which indicates that the model can provide high accuracy even when the temporal loss has little effect on the loss function. Temporal loss may actually play an auxiliary role in the model, which is used to improve the model's representation of hidden feature that is difficult to learn. Figure 7(c) shows that when the value of λ_{2} is small, the accuracy of the model increases with the value of λ_{2}. After the value of λ_{2} exceeds 1, the change in accuracy tends to level off. This result indicates that the effect of the model is strongly related to the presence or absence of contextual loss. When the proportion of contextual loss in the loss function is too small, the performance of the model is poor, and simply increasing the contextual loss will not make the model effect continue to rise, but remains constant at a certain value. The above results show that our model is insensitive to the parameters within a certain range.
4 Conclusion
This paper proposes an improved model based on TSTCC. This model uses an FFTbased data augmentation method. The positive sample pairs are obtained by different augmentations to the original data, and the encoder is used to convert the data into a representation vector. In order to obtain more hidden features, our model uses a contrastive Transformer architecture and context loss. In this paper, our proposed model is tested with other time series contrastive learning models on three datasets, HAR, sleepEDF, Epilepsy, based on selfsupervised learning for classification. The experimental result show that except for the Epilepsy dataset where it is slightly lower than the TSTCC model, our model is more accurate and stable than the other models on the three datasets. In the experiments with supervised and unsupervised learning, our designed multilayer LSTM encoder module can indeed learn the hidden feature of some indistinguishable classes. Moreover, the sensitivity test shows our designed module is insensitive to hyperparameters.
References
 Aghabozorgi S, Seyed Shirkhorshidi A, Ying Wah T. Timeseries clustering—A decade review [J]. Information Systems, 2015, 53: 1638. [CrossRef] [Google Scholar]
 Rabiner L, Juang B. An introduction to hidden Markov models [J]. IEEE ASSP Magazine, 1986, 3(1): 416. [CrossRef] [Google Scholar]
 Luenberger D G. Theory, Models, and Applications [M]. New York: Wiley, 1979. [Google Scholar]
 Wang X Y, Mueen A, Ding H, et al. Experimental comparison of representation methods and distance measures for time series data [J]. Data Mining and Knowledge Discovery, 2013, 26(2): 275309. [CrossRef] [MathSciNet] [Google Scholar]
 Längkvist M, Karlsson L, Loutfi A. A review of unsupervised feature learning and deep learning for timeseries modeling [J]. Pattern Recognition Letters, 2014, 42: 1124. [CrossRef] [Google Scholar]
 Wang H S, Zhang Q, Wu J, et al. Time series feature learning with labeled and unlabeled data [J]. Pattern Recognition, 2019, 89: 5566. [NASA ADS] [CrossRef] [Google Scholar]
 Jing L L, Tian Y L. Selfsupervised visual feature learning with deep neural networks: A survey [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(11): 40374058. [CrossRef] [PubMed] [Google Scholar]
 Doersch C, Gupta A, Efros A A. Unsupervised visual representation learning by context prediction [C]// IEEE International Conference on Computer Vision. New York: IEEE, 2015: 14221430. [Google Scholar]
 Gidaris S, Singh P, Komodakis N. Unsupervised representation learning by predicting image rotations [EB/OL]. [20220911]. https://arxiv.org/abs/1803.07728. [Google Scholar]
 Jaiswal A, Babu A R, Zadeh M Z, et al. A survey on contrastive selfsupervised learning [J]. Technologies, 2020, 9(1): 2. [Google Scholar]
 Eldele E, Ragab M, Chen Z H, et al. Timeseries representation learning via temporal and contextual contrasting [EB/OL]. [20220911]. https://arxiv.org/abs/2106.14112. [Google Scholar]
 Sarkar P, Etemad A. Selfsupervised ECG representation learning for emotion recognition [J]. IEEE Transactions on Affective Computing, 2022, 13(3): 15411554. [CrossRef] [Google Scholar]
 Löwe S, O'Connor P, Veeling B S. Putting an end to endtoend: Gradientisolated learning of representations [EB/OL]. [20220911]. https://arxiv.org/abs/1905.11786. [Google Scholar]
 Oord A V D, Li Y Z, Vinyals O. Representation learning with contrastive predictive coding [EB/OL]. [20220911]. https://arxiv.org/abs/1807.03748. [Google Scholar]
 Franceschi J Y, Dieuleveut A, Jaggi M. Unsupervised scalable representation learning for multivariate time series [EB/OL]. [20220809]. https://arxiv.org/abs/1901.10738. [Google Scholar]
 Tonekaboni S, Eytan D, Goldenberg A. Unsupervised representation learning for time series with temporal neighborhood coding [EB/OL]. [20220920]. https://arxiv.org/abs/2106.00750. [Google Scholar]
 Iwana B K, Uchida S. An empirical survey of data augmentation for time series classification with neural networks [J]. PLoS One, 2021, 16(7): e0254841. [Google Scholar]
 Zhou H Y, Zhang S H, Peng J Q, et al. Informer: Beyond efficient transformer for long sequence timeseries forecasting [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35(12): 1110611115. [CrossRef] [Google Scholar]
 Bazi Y, Bashmal L, Rahhal M M A, et al. Vision transformers for remote sensing image classification [J]. Remote Sensing, 2021, 13(3): 516. [NASA ADS] [CrossRef] [Google Scholar]
 Devlin J, Chang M W, Lee K, et al. BERT: Pretraining of deep bidirectional transformers for language understanding [EB/OL]. [20221009]. https://arxiv.org/abs/1810.04805. [Google Scholar]
 Anguita D, Ghio A, Oneto L, et al. A public domain dataset for human activity recognition using smartphones[J]. 21st European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning, 2013: 437442. [Google Scholar]
 Goldberger A L, Amaral L A N, Glass L, et al. PhysioBank, PhysioToolkit, and PhysioNet [J]. Circulation, 2000, 101(23): e215e220. [PubMed] [Google Scholar]
 Andrzejak R G, Lehnertz K, Mormann F, et al. Indications of nonlinear deterministic and finitedimensional structures in time series of brain electrical activity: Dependence on recording region and brain state [J]. Physical Review E, 2001, 64(6): 061907. [Google Scholar]
 Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations [C]// Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020: 15971607. [Google Scholar]
 van der Maaten L, Hinton G. Visualizing data using tSNE [J]. Journal of Machine Learning Research, 2008, 9(11): 2579260. [Google Scholar]
All Tables
Precision and recall rate of our model and TSTCC model on HAR with selfsupervised learning (unit:%)
Precision and recall rate of our model and TSTCC model on HAR with supervised learning (unit:%)
All Figures
Fig. 1 Architecture of our model  
In the text 
Fig. 2 Augmentation in time or frequency domain  
In the text 
Fig. 3 Encoder with 3layers LSTM  
In the text 
Fig. 4 Contrastive Transformer architecture  
In the text 
Fig. 5 Representation vector visualization on HAR using tSNE  
In the text 
Fig. 6 Test loss and test accuracy of our model and TSTCC  
In the text 
Fig. 7 Hyperparameter experiment result of parameters α (a), λ_{1}(b), λ_{2}(c)  
In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.