Issue 
Wuhan Univ. J. Nat. Sci.
Volume 27, Number 6, December 2022



Page(s)  508  520  
DOI  https://doi.org/10.1051/wujns/2022276508  
Published online  10 January 2023 
CLC number: TP 301
MpFedcon : ModelContrastive Personalized Federated Learning with the Class Center
School of Electronic and Electrical Engineering, Shanghai University of Engineering Science, Shanghai 201620, China
Received:
24
August
2022
Federated learning is an emerging distributed privacypreserving framework in which parties are trained collaboratively by sharing model or gradient updates instead of sharing private data. However, the heterogeneity of local data distribution poses a significant challenge. This paper focuses on the label distribution skew, where each party can only access a partial set of the whole class set. It makes global updates drift while aggregating these biased local models. In addition, many studies have shown that deep leakage from gradients endangers the reliability of federated learning. To address these challenges, this paper propose a new personalized federated learning method named MpFedcon. It addresses the data heterogeneity problem and privacy leakage problem from global and local perspectives. Our extensive experimental results demonstrate that MpFedcon yields effective resists on the label leakage problem and better performance on various image classification tasks, robust in partial participation settings, noniid data, and heterogeneous parties.
Key words: personalized federated learning / layered network / model contrastive learning / gradient leakage
Biography: LI Xingchen, male, Master candidate, research direction: federated learning. Email: 351977119@qq.com
Supported by the Scientific and Technological Innovation 2030—Major Project of "New Generation Artificial Intelligence" (2020AAA 0109300)
© Wuhan University 2022
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
0 Introduction
Data resources have become the lifeline of modern enterprise value creation and the new engine of digital technology power. In the process of industrial digital transformation, a large amount of valuable data scattered among all parties is generated. Due to increasing privacy concerns and data protection regulations^{[1]}, all parties cannot send their private data to a central server to train models. Federated learning (FL) is an emerging distributed machine learning paradigm that uses decentralized data from multiple parties to jointly train a shared global model without sharing the individuals' raw data^{[26]}. FL has achieved remarkable success in various industrial applications such as autonomous driving^{[7]}, wearable devices^{[8]}, medical diagnostics^{[9,10]}, and cell phones^{[11, 12]}. However, the nonindependent identically distributed (noniid) data poses a significant challenge. The data distribution of parties in FL might be highly variable since parties separately collect local data based on their preferences and sampling space. Label distribution skew is a common and serious category of noniid^{[3]}. Some studies have proved that the noniid data causes drift in the local updates of parties^{[13,14]}. In addition, the global model is further scattered by a collection of mismatched local optimal solutions, which eventually leads to a slow and unstable convergence of the overall training process^{[1517]}.
A variety of efforts attempt to address noniid data challenges. Some studies have shown that reducing data variability can improve the convergence of the global FL model^{[18,19]}. However, they usually need to modify the local distribution, which might result in the loss of important data about the inherent diversity of consumer behavior. Some methods stabilize the local training phase by adjusting the local and global model deviation across the parameter space, such as FedProx^{[20]}, SCAFFOLD^{[13]}. Other studies such as Ditto^{[19]}, APFL^{[21]} improve the generalization ability of the model by mixing global and local model strategies. We admit the fact that the local optimal points of parties are fundamentally inconsistent with the global optimal point in the heterogeneous FL setup. The majority of prior FL methods, however, compel local models to be consistent with the global model and ignore the problem of privacy leakage. For instance, DLG^{[22]} and iDLG^{[23]} have revealed that existing gradientbased privacy breaches are mainly attacked by inference through the properties of the last layer of the neural network.
Based on the above inference, we propose a modelcontrastive personalized learning with the class center, dubbed as MpFedcon, which is a typical personalized federated learning framework based on FedAvg (FederatedAveraging). Specifically, we apply a layered network that decouples the target neural network into a base encoder that participates in collaborative training and a locally preserved personalization layer. The base encoder layer learns global knowledge, while the personalization layer retains sensitive information to resist the deep leakage of gradients. Each party's local training is corrected from a global perspective by using the global class center contrastive learning. A global class center is defined as each class's average vector of representations^{[24]}. Further, inspired by Simsiam^{[25]}, MpFedcon greatly reduces the computational complexity by using only positive samples for training rather than negative sample pairs and large batches through modelcontrastive learning^{[26]}. MpFedcon significantly outperforms the other stateoftheart federated learning algorithms on various image classification datasets, including CIFAR10, CIFAR100, and FEMNIST^{[25,27]}. For instance, MpFedcon achieves 83.3% top1 accuracy on FEMNIST with 100 parties, while the best top1 accuracy of existing studies is 78.56%. Compared with the most classic FedAvg in the noniid setting, MpFedcon improves the convergence speed by 3.7 and 28.5 times and reduces the communication cost by 73.2% and 96.5% on the CIFAR10 and CIFAR100, respectively. The rest of this paper is arranged as follows: Section 1 reviews the related work of the FL, contrastive learning, and leakage from gradients. Section 2 explores the influence of local drift in FL. Section 3 gives problem statement and motivation. Section 4 describes the proposed method. The experimental results are presented in Section 5 to demonstrate the efficiency of our method. Finally, Section 6 concludes our work. Overall, the main contributions of this paper are as follows:
1) We propose a new personalized federated learning framework to solve the label distribution skew in FL, which mitigates the local and global drift problem by introducing the global class center modelcontrastive learning to correct local training.
2) We explore the causes of gradientbased privacy leakage, then design and verify the effectiveness of layered networks for defending against gradient leakage attacks.
3) We design the local layered network architecture to effectively learn the global underlying knowledge through supervised loss and contrastive loss functions, which promotes tight intraclass and separable interclass sample sets in the classification space.
4) We implement MpFedcon and conduct extensive experiments on different datasets. The results show that MpFedcon outperforms stateoftheart methods regarding inference accuracy and computational efficiency.
1 Related Work
1.1 Federated Learning
The standard federated learning approach aims at learning a single shared model that performs well on average across all parties. The classical federated learning method FedAvg^{[4]} follows the typical fourstep protocol shown in Fig. 1. ① The server randomly initializes the parameters of the global model and sends them to each party. ② Upon receiving the global model, each party updates the model based on its local training data using stochastic gradient descent (SGD). ③ The selected party uploads its local model parameters back to the server. ④ The server averages the model parameters to generate the global model for the next training round. Repeating these steps until convergence.
Fig. 1 The process of federated learning process ① Transfer model parameters; ② Security aggregation; ③ Uploading local parameters; ④ Local training; MLP: Multilayer perceptron; FC: Fully connected 
The noniid problem has been addressed in a wealth of studies with three main aspects: local training improvements, aggregation, and personalized models. Improvements in local training such as FedProx^{[20]} proposed to add a proximal term to normalize the Euclidean distance between the local and global models.
SCAFFOLD^{[13]} corrected the drift in local updates by introducing control variables. Other works were to improve aggregation efficiency, such as FedNova^{[18]}. APFL^{[21]} explored adaptive adjustment of global and local models to achieve personalized models. Fedper^{[28]}, Fedrep^{[29]}, and others explored layered network architecture, which aims to train personalized models for individual parties rather than a shared global model.
1.2 Contrastive Learning
The core idea of contrastive learning is to attract positive and reject negative sample pairs. Contrastive learning is widely used in selfsupervised representation learning. Supervised contrast learning is an extension of contrastive learning by combining label information to compose positive and negative samples. In fact, contrastive learning methods benefit from generous negative samples. InfoDist^{[30]} uses a memory bank to store negative sample pairs. SimCLR^{[31]} directly uses the negative samples coexisting in the current batch, so it requires a large batch size. However, selecting representative and informative negative samples is a critical and challenging task. SimSiam^{[25]} proposes a simple twin network to learn representations without negative sample pairs, large batch, and momentum encoding.
Contrastive learning in federated learning has recently emerged as an effective approach to solving noniid problems. Some existing approaches use a contrastive loss to compare different image representations, and they can utilize the huge unlabeled data on distributed edge devices^{[32, 33]}. Wang et al^{[34]} used a supervised contrastive learning to improve the quality of learned features to solve the longtail distribution problem in classification tasks. Wang et al^{[35]} explored the application of contrastive federated learning in medical image segmentation. However, they ignored the need for personalized models and did not explore the issue of gradientbased privacy leakage. In contrast to previous work, we introduce modelcontrastive learning with the global class center into supervised learning to address the issues of inconsistency in the embedding space for each party.
1.3 Leakage from Gradients
It is generally accepted that exchanging gradients across parties will not leak private training data in distributed learning systems, such as collaborative learning^{[36]} and federated learning^{[2, 3]}. Recently, Zhu et al^{[22]} proposed a method called DLG, which shows the possibility of obtaining private training data from publicly shared gradients. DLG^{[22]} synthesizes virtual data and corresponding labels under the supervision of shared gradients. The iDLG^{[23]} further demonstrates that the last layer of shared gradients must leak ground truth labels when the activation function is nonnegative. Wainakh et al^{[37]} further explored the properties of gradientbased leakage of true labels under large batch. Common techniques for protecting privacy include adding noise, gradient compression, discretization, and differential privacypreserving. But all these methods reduce the model accuracy to different degrees.
2 Local Drift in Federated Learning
In FedAvg, all parties optimize their models on the local dataset for each training round. Then the server updates the global model based on the expectations of the local model parameters. The objective is to solve:
where is the number of parties, is the private local dataset of party i, and is the expected loss of party . The overall goal is to obtain a globally optimal model on the global dataset .
There is a drift between the local and global models due to the label distribution skew, a special kind of noniid scene, where each party can only access a partial set of the whole class set^{[38]}. The performance of FedAvg is significantly reduced with the highly skewed noniid data in FL^{[13,20,39]}, indicating that ignoring local drift results in the deviation of global model. For this purpose, we give a baseline approach called SOLO, in which each party trains the model only by its local data without federated learning. In Fig. 2, we use a simple example to illustrate that a local drift in the party will lead to a biased global model in FedAvg. It assumes that the model has a nonlinear transformation function f(e.g., leakyrelu). Suppose and are local parameters for party 1 and party 2, is a data point, and the corresponding outputs for party 1 and party 2 are and . The parameters of the model generated by FedAvg can then be expressed as . is a parameter of the centralized model that can get the ideal output. As shown in Fig. 2, we have and f(w_{f},x)≠, indicating that the global model in FedAvg is skewed, which may lead to slow convergence and poor accuracy.
Fig. 2 Illustration of the local drift in FedAvg with a leakyrelu activation 
Figure 3 shows the precision results of training using only local data sets and MSE (mean square error) distance between the models in SOLO and FedAvg under the same conditions. It indicates that the accuracy cannot be improved obviously, and the interparty drift becomes more severe as the number of local iterations increases.
Fig. 3 Impact of different epochs when the party uses only local data "ep" represents the number of local epochs; The bar chart shows the MSE distance of the SOLO and FedAvg models; The curves indicate the accuracy of the different local epochs for each round 
In this case, each party should have a personalized model to suit its unique data distribution. It is necessary to correct the local optimization direction from a global perspective to align the local optimization direction with the global optimization direction to improve the effect of FL.
3 Problem Statement and Motivation
Suppose there are parties (), where party has a local dataset . The server and parties attempt to jointly learn the parameters of the global representation, while the party tries to learn its unique model locally. The personalized federated learning can solve:
where is the empirical loss of , and are the error function and learning model of the . Most participants do not have sufficient local data and can only observe a subset of the total categories in practical federated learning scenarios. Parties may be unable to obtain solutions with the expected low risk through local training. Therefore, parties need to learn the model through federated learning to use the cumulative data from all parties. MpFedcon is based on an intuitive idea: It can improve the accuracy of classification tasks through correcting local and global distribution consistency in labelabsent scenarios in FL; a layered network facilitates the construction of a personalized model, then personalized layers further fit its data distributions and prevent sensitive information leakage. The effectiveness of layered networks against gradient leakage is analyzed in Section 4.4.
To further verify this intuition, we now discuss the observations that motivate the correction of local training. We explore a more skewed data imbalance issue: label distribution skew, which means each party could only access a subset of the entire class collection^{[40]}. Specifically, we first train a CNN (Convolutional Neural Network) model on CIFAR10 as a center model. Then, we partition the dataset into 10 subsets in an unbalanced manner and train a CNN model on each subset as SOLO model, where a subset contains 5 classes of data. We use the tSNE^{[41]} to visualize the hidden vectors of images from a randomly selected SOLO model and center model as shown in Fig. 4(a) and Fig. 4(b). The SOLO method learns better features, but its clustering degree and clustering centers differ significantly from the global distribution in the ideal condition. This may hinder the accuracy of downstream classification tasks. Figure 4(c) shows the representation learned by the FedAvg algorithm. We can observe that the points with the same class are more confused in Fig. 4(c) compared with Fig. 4(a). The FedAvg even leads the model to learn a worse representation due to the skewed local data distribution. This further verifies that the inconsistency of local and global data distribution will significantly affect the performance of federated learning. MpFedcon corrects the local update direction by introducing a global class center from the perspective of global clustering. As shown in Fig. 4(d), the local party data are restricted to the same region as the global distribution after the MpFedcon method, so there is space further to improve the aggregation effect of the central model and enhance the classification effect of downstream tasks.
Fig. 4 TSNE visualizations of hidden vectors on CIFAR10 
4 Method
Based on the above ideas, we propose MpFedcon, a simple and effective FL framework based on FedAvg. Since there is a fundamental contradiction between local and global optimum, MpFedcon aims to constrain the local update direction to be consistent with the global optima, and further fit its unique data distribution by personalized layers while sensitive information is retained locally. In the following, we present the local network architecture, the global class center, the local objective, and privacy protection based on gradient leakage.
4.1 Local Network Architecture
As shown in Fig. 5, the local network consists of three components: a base encoder, a projection head, and an output layer. Specifically, since the heterogeneous data distributed across tasks may share a common representation, we use the base encoder to extract common representation vectors from inputs to improve the quality of each party model. Then the representation is mapped to a space with a fixed dimension using an additional projection head. We use a multilayer perceptron (MLP) with hidden layers to implement the projection head, which helps to improve the representation of the layers that precede it^{[31]}. At last, the output layer predicts values for each class. Locally retained personalized layers include a projection head and an output layer that protect privacy and adapt to local data distribution. It further mitigates the impact of noniid on model training.
Fig. 5 Overview of ith local network architecture in MpFedcon The feature extraction network (including the initial encoder, base encoder and MLP) extracts the representation and then the local network is combined with global center features to calculate the contrast loss . The output layer FC predicts the classwise logits to compute the crossentropy 
For ease of representation, with model weight , we use , , and to denote the entire network, base encoding, projection head, and output layer, respectively. When studying the supervised setup, the base encoder extracts the feature representation from the input . The feature representation is mapped to the lowdimensional space through the projection head for computing the contrast loss . The output layer predicts classwise logits , which are used to calculate typical loss terms in supervised learning. The model for is composition of its local parameters and the representation: , where is the locally retained personalized layers, including a projection head and output layer, and denotes a common representation of the base encoder extraction.
4.2 The Global Class Center
As shown in Fig. 5, we introduce the global class center as the optimization target for each class from a global perspective. The global server stores and maintains the class centers through a Memory Bank^{[42]}. In the supervised scenario, samples of the same class are restricted to the class center region, thus effectively solving the problem of skewed optimization direction due to the label distribution skew. The classes centers are updated as follows:
where denotes the samples of class , is the number of class samples, denotes the feature output of the projection head, is the local class center obtained after training local data in round t. The class center of each party is aggregated and averaged on the server to obtain the global class center , then the server distributes it to participants next round. Aggregated data is more conducive to training federated learning than skewed data. We aim to find more desirable class center locations from a global perspective and thus improve the classification performance of downstream tasks.
4.3 Local Objective
The local loss consists of two parts. The first part is a typical loss term in supervised learning , denoted as . The second part is our proposed global class center model contrastive loss term, denoted as . In the tth training round, party receives a common base encoder model and the global class center set , combined with the locally retained personalized layers as the initialized for this round. Let where denotes the initial model parameters in this round and does not participate in the gradient update. Let represent the class center feature vector of the ith class. represents the feature representation from the local model being updated, is the mapped representation of input by the initial model . Since the global model has a more robust representation, we correct the local update direction by reducing and and increasing the distance between and through the global class center . The model contrastive loss is defined as:
where denotes the temperature parameter, is the cosine similarity. The local objective is to minimize
where is the hyperparameter that regulates the weights of the two terms. The overall algorithm is described in algorithm 1.
When round , the server initializes the model, and sends them to all clients. In other rounds, the server receives a local base encoder model from participants, and updates them by weighted average method to obtain , then sends it to the participants in the next round. In addition to initialization, the communication process only transmits partial network parameters. In partyside training, the party updates the model using local data via SGD and updates each class center.
4.4 Privacy Protection Based on Gradient Leakage
Neural network models are usually trained by a hotlabel (onehot) crossentropy loss function, which can be defined as:
where is the input data, is the corresponding ground truth label, is the number of classes. We have when , otherwise And is the prediction score of input through neural network, and denotes the output of after the activation function.
The gradient vector of the weight connected to the ith logit can be written as:
Based on this independent of the model architecture and parameter rules, it is possible to identify the groundtruth label of the private training data from the shared gradient . In other words, this inference is applicable to any network in any training phase from any random initialization of the parameters^{[23]}.
Gradientbased attacks require access to the complete gradient information, especially in the last layer. An intuitive defense strategy is gradient masking, which transmits incomplete gradient information that does not affect collaborative modeling. We design a layered network structure that locally preserves the gradient information of the personalized layers. For instance, when the last layer is masked, the attacker can only infer the label from the gradient information of the inverted second layer. The gradient vector the weight connected to the ith logit in the output layer can be written as:
where denotes the weight parameter of the layer weight parameter associated with the hidden layer neuron . The sign of is associated with the uncertain value of ,so the gradient information and labeling relationship cannot be accurately determined by the above conclusion. The experimental validation process is described in Section 5.10.
5 Experiment Studies
To demonstrate the superiority of this work, the MpFedcon is compared with the stateoftheart federated learning algorithms. The global FL approaches include FedAvg^{[4]}, Fedprox^{[20]}, SCAFFOLD^{[13]}, The personalized FL approaches, such as PerFedAvg^{[27] }uses metalearning to learn an initial model before adapting to each task to finetune it. APFL^{[21]} interpolates between local and global models, and Ditto^{[19]} learns local models and encourages these models to be tightly coupled through global regularization. Fedper^{[28]}, Fedrep^{[29]} are also a layered network architecture, as they learn a global representation and personalization head. However, these methods do not explore privacy protection. We use SOLO as a baseline method. Recall that the SOLO approach involves each party training a model with local data without federated learning. Further, we compare the single global model and its finetuned approach. To obtain the finetuning results, we first train the global model for the entire training cycle, and then each party finetunes its local training data by 10 SGD only, then calculate the final test accuracy.
5.1 Experimental Setup
Experiments are conducted over three standard datasets: CIFAR10, CIFAR100^{[43]} and FEMNIST^{[44]}. The heterogeneity of the CIFAR10 and CIFAR100 is controlled by assigning different class numbers to each party. Each party is assigned the same number of training samples. For FEMNIST, the dataset is restricted to 10 handwritten letters, and samples are assigned to the parties according to the lognormal distribution^{[38]}. There is a partition containing 150 parties, with an average of 148 samples/parties. As in the previous work^{[28]}, a 5layer CNN model is used as the base encoder for CIFAR10 and CIFAR100, and a 2layer MLP for FEMNIST. The projection head for all methods consists of a 2layer MLP, while the output layer is a single linear layer. MpFedcon performs 10 SGD local epochs with momentum to train the local head, followed by one epoch for the base encoder layer in the case of CIFAR10 and five epochs in all other cases. All other methods use the same number of local epochs as MpFedcon to update the base encoder layer. The accuracy is calculated by taking the average local accuracy of all users in the last 10 rounds of communication.
5.2 Accuracy Results
Table 1 lists the top1 test accuracy of all methods. The SOLO method has better performance results since it can fit the local data preferably, as the data assigned to each party is small and biased. The data skew distribution severely impairs the performance of FedAvg. The SCAFFOLD and FedProx methods based on FedAvg perform much worse than FedAvg, so it may be difficult to find the right direction to correct data heterogeneity. Furthermore, APFL and Ditto outperform the classical FedAvg performance because the hybrid and regularization methods partly bridge the local and global model drift. Surprisingly, the finetuned FedAvg method performs well, probably because the finetuning adapts to the unique data distribution. Fedper and Fedrep methods based on layered networks further improve the accuracy. However, none of the above methods addresses the inconsistency between local and global optimization objectives due to data heterogeneity, which affects the final performance results. It can be observed that MpFedcon performs the best on the datasets with different degrees of heterogeneity. Subject to similar semantic interference, the MpFedcon has a 0.27% to 0.84% accuracy improvement on CIFAR100 with hyperclassification, more than 1.65% on CIFAR10, and more than 4.74% on FEMNIST. It shows that MpFedcon effectively improves the federated learning effect.
The top1 accuracy of MpFedcon and the other methods on test datasets (unit:%)
5.3 Effect of Data Heterogeneity
To evaluate the effect of heterogeneity, we control the degree of heterogeneity of each party by varying the number of classes. For the CIFAR dataset, the number of training samples per party is equal to 50 000/n, where n represents the number of parites, so columns with 100 parties have 500 training samples per party. Comparatively, columns with 1 000 parties have only 50 training samples per party. As we can see from Table 1, MpFedcon always achieves the best accuracy in all cases. The advantage of MpFedcon is the introduction of class centers, which can be used as global knowledge to correct local training, and personalized classification layers further fit local data to improve the classification.
5.4 Impact of Global Communication Rounds (T)
Figure 6 shows the accuracy of each round during the training period. MpFedcon achieves the best performance at the end of training. In addition, the curves in Fig. 6 show that MpFedcon sacrifices the convergence speed in the early stages because the learning class central features affect the overall optimization direction in the early stages. The FedAvg, Fedprox, and SCAFFOLD converge slowly and fluctuate greatly with increased communication rounds. It shows that the methods of sharing the same network or modifying the gap between the local and global networks are not applicable under heterogeneous settings. Although Fedper and Fedrep, based on simple layered networks, learn quickly in the early stage, MpFedcon performs better in the later stages. In other words, a better class centered representation gives the classifier better classification ability at a later stage.
Fig. 6 Top1 test accuracy with different number of communication rounds (T) 
5.5 Influence of Local Epoch Number (E)
We study the influence of local epoch numbers on the accuracy of the final model. Figure 7 shows the effect on accuracy and convergence speed during training. The accuracy and convergence speed are reduced when the number of local epochs is 1, especially FedAvg. It can be observed in Fig. 8 that when the number of local epochs most methods have the highest accuracy and faster convergence. This is because when E is small, the local networks cannot be fully trained and converge slowly. However, the improvement of accuracy and convergence speed will be slight when , and there may be overfitting for local training of skewed data, which leads to a decrease in the accuracy of the global model.
Fig. 7 Top1 test accuracy curves of different local epoch numbers 
Fig. 8 Top1 test accuracy line chart of local epoch number(E) of different algorithms 
5.6 Scalability
To demonstrate the scalability of MpFedcon, we use different numbers of parties to participate in the training on the CIFAR10 dataset. Specifically, we try two settings: 1) the dataset is divided into 50 parties and 5 parties per round are randomly selected; 2) the dataset is divided into 100 parties and 10 parties in each round of federated training are randomly selected. The results are shown in Table 2. For MpFedcon, the results are shown for ,which best outperforms Fedrep with over 2% accuracy at 200 rounds with 50 parties and 5% accuracy at 200 rounds with 100 parties. Partial party participation means that the active data is only a subset of all training data, which leads to unstable training and slower convergence. MpFedcon consistently achieves the best performance with the participation of different parts in Table 2, which shows that the performance of MpFedcon will not be affected by the increases in the number of parties.
Top1 test accuracy with varying number of parties (m) and communication rounds (T) on CIFAR10 (heterogeneity: 100/5) (unit:%)
5.7 Effect of Coefficient in the Loss Function ()
In this work, we use the coefficient to adjust the weights of the classes' centers feature learning and classifier learning during training. Different coefficient μ of experiments are explored on CIFFAR10. Specifically, μ is a hyperparameter used to weigh the class center's optimization direction against its dataset's optimization direction. As shown in Table 2, MpFedcon achieves the best results when =10. A smaller coefficient μ increases the fitting effect of the personalization layer on a small amount of local data, thus improving the model accuracy. While a larger μ slows down the convergence in the short term, it improves the overall classification effect in subsequent exchanges.
5.8 Communication Efficiency
The communication overhead of federated learning is mainly caused by the transfer of data (e.g., models, parameters) between the party and the central server. Many current studies focus on studying the reduction of one aspect, such as reducing the number of communications without caring about the cost of a single transmission. We believe that a more credible metric for judging the communication cost is the total amount of communication data at convergence. It can be expressed as:
where is the total communication volume, is the number of communications, and denotes a single traffic volume. For a fair comparison, each algorithm uses the same network structure with single traffic of 1.2 and 2.2 MB for CIFAR10 and CIFAR100, respectively.
FedAvg reduces the number of communications by increasing the number of local updates. FedAvg algorithm converges under both iid data and noniid data. However, the convergence speed of FedAvg is limited by the distribution state of the dataset. As shown in Table 3, the most representative algorithms are compared in heterogeneous environments to obtain the same accuracy. FedAvg sacrifices the communication cost to improve the model's accuracy. Fedprox and Fedrep have the same singletransfer cost as FedAvg, benefiting from the convergence speed and smaller total communication cost. Especially Fedrep's personalized model dramatically improves the convergence speed and has the smallest communication cost. Compared with FedAvg and Fedrep, MpFedcon adds a smaller amount of additional class center features to be transmitted. But with the increase in data volume and communication rounds, MpFedcon has a greater advantage in terms of computational cost. The contrastive loss term can effectively improve the accuracy without reducing the overall convergence speed.
Accuracy with 50 parties and 100 parties (sample fraction=0.1) on CIFAR10 and CIFAR100 (heterogeneity: 100/5)
5.9 Effectiveness of MpFedcon
For demonstration purposes, we use the SOLO and the most classical FedAvg method to evaluate the effectiveness of MpFedcon. We take the SOLO method of each party as the test baseline, then Fig. 9 visualizes the improvement of each party after passing the MpFedcon and FedAvg. As shown in Fig. 9, MpFedcon effectively improves precision for more than 70% participating parties in a highly heterogeneous setting. However, the classic Fedavg has almost no accuracy improvement for the participating parties. The classical FedAvg approach almost fails in the case of data heterogeneity.
Fig. 9 Effectiveness of precision improvement of MpFedcon and FedAvg for 100 party segments involved in training 
5.10 Gradient LeakageBased Privacy Defense
For a fair comparison, experiments are shown on the CIFAR10 and CIFAR100 datasets for the classification tasks according to the settings in iDLG^{[22]}. LeNet is initialized with random for all experiments, and we use LBFGS^{[45]} with a learning rate of 1 as the optimizer.
The gradient attack is visualized by the same conditions for the same random image as in Fig. 10. The curve represents the MSE between the generated image and the real image. Then we visualize the final image generated by each method. MpFedcon masks the gradient information of the classification layer, and the attacker cannot accurately know the number of parties' personalized layers and sensitive information. To effectively test the possible cases of gradient attack, the experiments verify the effect of gradient attack by setting , where is the number of FC layers on the party side, is the number of FC layers on the attacker side. Even with only 1 FC layer, the attacker still fails to identify effectively after many iterations. However, the DLG and iDLG methods accurately restore the image after 50 rounds of iterations.
Fig. 10 The effectiveness of various defense strategies 
In addition, it can be seen from Fig. 10 that the more FC layers the party masks, the bigger the error caused and the more difficult it is to be attacked. Table 4 shows the traditional defenses based on Gaussian noise and Laplace noise with a large variance of effectively defend against noise defense, but both severely degrade the accuracy^{[22]}. The results show that MpFedcon effectively resists the privacy leakage problem of gradientbased attacks while ensuring the model's accuracy.
Testing datasets performance (unit:%)
6 Conclusion
Noniid is a significant obstacle to the availability of federated learning. To improve the performance of federated learning models on noniid datasets, we propose a new MpFedcon algorithm with resistance to label leakage. Specifically, MpFedcon uses all party's data to learn a global representation and corrects the local optimization direction to be consistent with the global distribution by modelcontrastive loss with the class center. Utilizing the computing resources of parties to conduct numerous local updates can further fit the local data distribution while retaining sensitive information to prevent label disclosure. Extensive experiments on various image classification datasets demonstrate the advantage of MpFedcon on noniid data. As MpFedcon does not require the inputs to be images, it is potentially applied to nonvision problems.
References
 Weber P A, Zhang N, Wu H M. A comparative analysis of personal data protection regulations between the EU and China [J]. Electronic Commerce Research, 2020, 20(3): 565587. [CrossRef] [Google Scholar]
 Guo P F, Wang P Y, Zhou J Y, et al. Multiinstitutional collaborations for improving deep learningbased magnetic resonance image reconstruction using federated learning [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 24232432. [Google Scholar]
 Kairouz P, McMahan H B, Avent B, et al. Advances and open problems in federated learning [J]. Foundations and Trends® in Machine Learning, 2021, 14(1/2): 1210. [CrossRef] [Google Scholar]
 McMahan H B, Moore E, Ramage D, et al. Communicationefficient learning of deep networks from decentralized data [EB/OL]. [20220917]. https://arxiv.org/abs/1602.05629. [Google Scholar]
 Mothukuri V. A survey on security and privacy of federated learning [J]. Future Generation Computer Systems, 2021, 115: 619640. [CrossRef] [Google Scholar]
 Wang X F, Wang C Y, Li X H, et al. Federated deep reinforcement learning for Internet of Things with decentralized cooperative edge caching [J]. IEEE Internet of Things Journal, 2020, 7(10): 94419455. [CrossRef] [Google Scholar]
 Samarakoon S, Bennis M, Saad W, et al. Distributed federated learning for ultrareliable lowlatency vehicular communications [J]. IEEE Transactions on Communications, 2020, 68(2): 11461159. [CrossRef] [Google Scholar]
 Begum A M, Mondal M R H, Podder P, et al. Detecting spinal abnormalities using multilayer perceptron algorithm [C]//Innovations in BioInspired Computing and Applications. Cham: Springer International Publishing, 2022: 654664. [Google Scholar]
 Dong J H, Cong Y, Sun G, et al. What can be transferred: Unsupervised domain adaptation for endoscopic lesions segmentation [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 40224031. [Google Scholar]
 Yang Q, Zhang J Y, Hao W T, et al. FLOP: Federated learning on medical datasets using partial networks [C]// Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. New York: ACM, 2021: 38453853. [Google Scholar]
 Ramaswamy S, Mathews R, Rao K, et al. Federated learning for emoji prediction in a mobile keyboard [EB/OL]. [20220923]. https://arxiv.org/abs/1906.04329. [Google Scholar]
 Duan M M, Liu D, Chen X Z, et al. Selfbalancing federated learning with global imbalanced data in mobile systems [J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(1): 5971. [CrossRef] [Google Scholar]
 Karimireddy S P, Kale S, Mohri M, et al. SCAFFOLD: Stochastic controlled averaging for ondevice federated learning [EB/OL]. [20220923]. https://arxiv.org/abs/1910.06378. [Google Scholar]
 Khaled A, Mishchenko K, Richtárik P. Tighter theory for local SGD on identical and heterogeneous data [EB/OL]. [20220923]. https://arxiv.org/abs/1909.04746. [Google Scholar]
 Jiang M R, Wang Z R, Dou Q. HarmoFL: Harmonizing local and global drifts in federated learning on heterogeneous medical images [J]. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36(1): 10871095. [CrossRef] [Google Scholar]
 Li T, Sahu A K, Zaheer M, et al. Federated optimization in heterogeneous networks [EB/OL]. [20220923]. https://arxiv.org/abs/1812.06127. [Google Scholar]
 Hsieh K, Phanishayee A, Mutlu O, et al. The nonIID data quagmire of decentralized machine learning [C]// Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020: 43874398. [Google Scholar]
 Wang J Y, Liu Q H, Liang H, et al. Tackling the objective inconsistency problem in heterogeneous federated optimization[EB/OL]. [20220923]. https://arxiv.org/abs/2007.07481. [Google Scholar]
 Li T, Hu S Y, Beirami A, et al. Ditto: Fair and robust federated learning through personalization [EB/OL]. [20220923]. https://arxiv.org/abs/2012.04221. [Google Scholar]
 Li T, Sahu A K, Zaheer M, et al. Federated optimization in heterogeneous networks [EB/OL]. [20220923]. https://arxiv.org/abs/1812.06127. [Google Scholar]
 Deng Y Y, Kamani M M, Mahdavi M. Adaptive personalized federated learning [EB/OL]. [20220923]. https://arxiv.org/abs/2003.13461. [Google Scholar]
 Zhu L, Liu Z, Han S. Deep leakage from gradients [EB/OL]. [20220923]. https://arxiv.org/pdf/1906.08935. [Google Scholar]
 Zhao B, Mopuri K R, Bilen H. iDLG: Improved deep leakage from gradients [EB/OL]. [20220923]. https://arxiv.org/abs/2001.02610. [Google Scholar]
 Snell J, Swersky K, Zemel R. Prototypical networks for fewshot learning [C]// Proceedings of the 31st International Conference on Neural Information Processing Systems. New York: ACM, 2017: 40804090. [Google Scholar]
 Chen X L, He K M. Exploring simple Siamese representation learning [C]// 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 1574515753. [Google Scholar]
 Li Q B, He B S, Song D. Modelcontrastive federated learning [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 1070810717. [Google Scholar]
 Fallah A, Mokhtari A, Ozdaglar A. Personalized federated learning with theoretical guarantees: A modelagnostic metalearning approach [C]// Proceedings of the 34th International Conference on Neural Information Processing Systems. New York: ACM, 2020: 35573568. [Google Scholar]
 Arivazhagan M G, Aggarwal V, Singh A K, et al. Federated learning with personalization layers [EB/OL]. [20220930]. https://arxiv.org/abs/1912.00818. [Google Scholar]
 Collins L, Hassani H, Mokhtari A, et al. Exploiting shared representations for personalized federated learning [EB/OL]. [20220923]. https://arxiv.org/abs/2102.07078. [Google Scholar]
 He K M, Fan H Q, Wu Y X, et al. Momentum contrast for unsupervised visual representation learning [C]// 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2020: 97269735. [Google Scholar]
 Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations [EB/OL]. [20220923]. https://arxiv.org/abs/2002.05709. [Google Scholar]
 Khosla P, Teterwak P, Wang C, et al. S3upervised contrastive learning [EB/OL]. [20220923]. https://arxiv.org/abs/2004.11362. [Google Scholar]
 van Berlo B, Saeed A, Ozcelebi T. Towards federated unsupervised representation learning [C]// Proceedings of the Third ACM International Workshop on Edge Systems, Analytics and Networking. New York: ACM, 2020: 3136. [Google Scholar]
 Wang P, Han K, Wei X S, et al. Contrastive learning based hybrid networks for longtailed image classification [C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New York: IEEE, 2021: 943952. [Google Scholar]
 Wang W, Zhou T, Yu F, et al. Exploring crossimage pixel contrast for semantic segmentation [C]//2021 IEEE/CVF International Conference on Computer Vision (ICCV). New York: IEEE, 2021: 72837293. [Google Scholar]
 Song G C, Chai W. Collaborative learning for deep neural networks [C]// Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York: ACM, 2018: 18371846. [Google Scholar]
 Wainakh A, Ventola F, Müßig T, et al. Userlevel label leakage from gradients in federated learning [J]. Proceedings on Privacy Enhancing Technologies, 2022, 2022(2): 227244. [Google Scholar]
 Li T, Sahu A K, Zaheer M, et al. FedDANE: A federated Newtontype method [C]//2019 53rd Asilomar Conference on Signals, Systems, and Computers. New York: IEEE, 2019: 12271231. [Google Scholar]
 Zhao Y, Li M, Lai L Z, et al. Federated learning with noniid data[EB/OL].[20220923]. https://arxiv.org/abs/1806.00582. [Google Scholar]
 Yu F X, Rawat A S, Menon A K, et al. Federated learning with only positive labels [C]// Proceedings of the 37th International Conference on Machine Learning. New York: ACM, 2020: 1094610956. [Google Scholar]
 van der Maaten L, Hinton G. Visualizing data using tSNE [J]. Journal of Machine Learning Research, 2008, 9: 25792625. [Google Scholar]
 Oord A V D, Li Y Z, Vinyals O. Representation learning with contrastive predictive coding [EB/OL]. [20220923]. https://arxiv.org/abs/1807.03748. [Google Scholar]
 Krizhevsky A, Hinton G. Learning multiple layers of features from tiny images[EB/OL].[20220923]. https://www.semanticscholar.org/paper/LearningMultipleLayersofFeaturesfromTinyKrizhevsky/5d90f06bb70a0a3dced62413346235c02b1aa086. [Google Scholar]
 Caldas S, Duddu S M K, Wu P, et al. LEAF: A benchmark for federated settings [EB/OL]. [20220923]. https://arxiv.org/abs/1812.01097. [Google Scholar]
 Liu D C, Nocedal J. On the limited memory BFGS method for large scale optimization [J]. Mathematical Programming, 1989, 45(1): 503528. [Google Scholar]
 Zhang F D, Kuang K, You Z Y, et al. Federated unsupervised representation learning [EB/OL]. [20220923]. https://arxiv.org/abs/2010.08982. [Google Scholar]
All Tables
Top1 test accuracy with varying number of parties (m) and communication rounds (T) on CIFAR10 (heterogeneity: 100/5) (unit:%)
Accuracy with 50 parties and 100 parties (sample fraction=0.1) on CIFAR10 and CIFAR100 (heterogeneity: 100/5)
All Figures
Fig. 1 The process of federated learning process ① Transfer model parameters; ② Security aggregation; ③ Uploading local parameters; ④ Local training; MLP: Multilayer perceptron; FC: Fully connected 

In the text 
Fig. 2 Illustration of the local drift in FedAvg with a leakyrelu activation  
In the text 
Fig. 3 Impact of different epochs when the party uses only local data "ep" represents the number of local epochs; The bar chart shows the MSE distance of the SOLO and FedAvg models; The curves indicate the accuracy of the different local epochs for each round 

In the text 
Fig. 4 TSNE visualizations of hidden vectors on CIFAR10  
In the text 
Fig. 5 Overview of ith local network architecture in MpFedcon The feature extraction network (including the initial encoder, base encoder and MLP) extracts the representation and then the local network is combined with global center features to calculate the contrast loss . The output layer FC predicts the classwise logits to compute the crossentropy 

In the text 
Fig. 6 Top1 test accuracy with different number of communication rounds (T)  
In the text 
Fig. 7 Top1 test accuracy curves of different local epoch numbers  
In the text 
Fig. 8 Top1 test accuracy line chart of local epoch number(E) of different algorithms  
In the text 
Fig. 9 Effectiveness of precision improvement of MpFedcon and FedAvg for 100 party segments involved in training  
In the text 
Fig. 10 The effectiveness of various defense strategies  
In the text 
Current usage metrics show cumulative count of Article Views (fulltext article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.
Data correspond to usage on the plateform after 2015. The current usage metrics is available 4896 hours after online publication and is updated daily on week days.
Initial download of the metrics may take a while.