MpFedcon : Model-Contrastive Personalized Federated Learning with the Class Center

: Federated learning is an emerging distributed privacy-preserving framework in which parties are trained collaboratively by sharing model or gradient updates instead of sharing private data. However, the heterogeneity of local data distribution poses a significant challenge. This paper focuses on the label distribution skew, where each party can only access a partial set of the whole class set. It makes global updates drift while aggregating these bi‐ ased local models. In addition, many studies have shown that deep leakage from gradients endangers the reliability of federated learn‐ ing. To address these challenges, this paper propose a new person‐ alized federated learning method named MpFedcon. It addresses the data heterogeneity problem and privacy leakage problem from global and local perspectives. Our extensive experimental results demonstrate that MpFedcon yields effective resists on the label leakage problem and better performance on various image classifi‐ cation tasks, robust in partial participation settings, non-iid data, and heterogeneous parties.


Introduction
Data resources have become the lifeline of modern enterprise value creation and the new engine of digital technology power.In the process of industrial digital transformation, a large amount of valuable data scattered among all parties is generated.Due to increasing privacy concerns and data protection regulations [1] , all parties cannot send their private data to a central server to train models.Federated learning (FL) is an emerging distributed machine learning paradigm that uses decentralized data from multiple parties to jointly train a shared global model without sharing the individuals  raw data [2][3][4][5][6] .FL has achieved remarkable success in various industrial applications such as autonomous driving [7] , wearable devices [8] , medical diagnostics [9,10] , and cell phones [11,12] .However, the non-independent identically distributed (non-iid) data poses a significant challenge.The data distribution of parties in FL might be highly variable since parties separately collect local data based on their preferences and sampling space.Label distribution skew is a common and serious category of non-iid [3] .Some studies have proved that the non-iid data causes drift in the local updates of parties [13,14] .In addition, the global model is further scattered by a collection of mismatched local optimal solutions, which eventually leads to a slow and unstable convergence of the overall training process [15][16][17] .
A variety of efforts attempt to address non-iid data challenges.Some studies have shown that reducing data variability can improve the convergence of the global FL model [18,19] .However, they usually need to modify the lo-cal distribution, which might result in the loss of important data about the inherent diversity of consumer behavior.Some methods stabilize the local training phase by adjusting the local and global model deviation across the parameter space, such as FedProx [20] , SCAFFOLD [13] .Other studies such as Ditto [19] , APFL [21] improve the generalization ability of the model by mixing global and local model strategies.We admit the fact that the local optimal points of parties are fundamentally inconsistent with the global optimal point in the heterogeneous FL setup.The majority of prior FL methods, however, compel local models to be consistent with the global model and ignore the problem of privacy leakage.For instance, DLG [22] and iDLG [23] have revealed that existing gradientbased privacy breaches are mainly attacked by inference through the properties of the last layer of the neural network.
Based on the above inference, we propose a modelcontrastive personalized learning with the class center, dubbed as MpFedcon, which is a typical personalized federated learning framework based on FedAvg (Federat-edAveraging).Specifically, we apply a layered network that decouples the target neural network into a base encoder that participates in collaborative training and a locally preserved personalization layer.The base encoder layer learns global knowledge, while the personalization layer retains sensitive information to resist the deep leakage of gradients.Each partys local training is corrected from a global perspective by using the global class center contrastive learning.A global class center is defined as each classs average vector of representations [24] .Further, inspired by Simsiam [25] , MpFedcon greatly reduces the computational complexity by using only positive samples for training rather than negative sample pairs and large batches through model-contrastive learning [26] .MpFedcon significantly outperforms the other state-ofthe-art federated learning algorithms on various image classification datasets, including CIFAR-10, CIFAR-100, and FEMNIST [25,27] .For instance, MpFedcon achieves 83.3% top-1 accuracy on FEMNIST with 100 parties, while the best top-1 accuracy of existing studies is 78.56%.Compared with the most classic FedAvg in the non-iid setting, MpFedcon improves the convergence speed by 3.7 and 28.5 times and reduces the communication cost by 73.2% and 96.5% on the CIFAR-10 and CIFAR-100, respectively.The rest of this paper is arranged as follows: Section 1 reviews the related work of the FL, contrastive learning, and leakage from gradients.Section 2 explores the influence of local drift in FL.Section 3 gives problem statement and motivation.Section 4 describes the proposed method.The experimental results are presented in Section 5 to demonstrate the efficiency of our method.Finally, Section 6 concludes our work.Overall, the main contributions of this paper are as follows: 1) We propose a new personalized federated learning framework to solve the label distribution skew in FL, which mitigates the local and global drift problem by introducing the global class center model-contrastive learning to correct local training.
2) We explore the causes of gradient-based privacy leakage, then design and verify the effectiveness of layered networks for defending against gradient leakage attacks.
3) We design the local layered network architecture to effectively learn the global underlying knowledge through supervised loss and contrastive loss functions, which promotes tight intra-class and separable interclass sample sets in the classification space.
4) We implement MpFedcon and conduct extensive experiments on different datasets.The results show that MpFedcon outperforms state-of-the-art methods regarding inference accuracy and computational efficiency.

Federated Learning
The standard federated learning approach aims at learning a single shared model that performs well on average across all parties.The classical federated learning method FedAvg [4] follows the typical four-step protocol shown in Fig. 1. ① The server randomly initializes the parameters of the global model and sends them to each party.② Upon receiving the global model, each party updates the model based on its local training data using stochastic gradient descent (SGD).③ The selected party uploads its local model parameters back to the server.④ The server averages the model parameters to generate the global model for the next training round.Repeating these steps until convergence.
The non-iid problem has been addressed in a wealth of studies with three main aspects: local training improvements, aggregation, and personalized models.Improvements in local training such as FedProx [20] proposed to add a proximal term to normalize the Euclidean distance between the local and global models.SCAFFOLD [13] corrected the drift in local updates by introducing control variables.Other works were to improve aggregation efficiency, such as FedNova [18] .APFL [21] explored adaptive adjustment of global and local models to achieve personalized models.Fedper [28] , Fedrep [29] , and others explored layered network architecture, which aims to train personalized models for individual parties rather than a shared global model.

Contrastive Learning
The core idea of contrastive learning is to attract positive and reject negative sample pairs.Contrastive learning is widely used in self-supervised representation learning.Supervised contrast learning is an extension of contrastive learning by combining label information to compose positive and negative samples.In fact, contrastive learning methods benefit from generous negative samples.InfoDist [30] uses a memory bank to store negative sample pairs.SimCLR [31] directly uses the negative samples coexisting in the current batch, so it requires a large batch size.However, selecting representative and informative negative samples is a critical and challenging task.SimSiam [25] proposes a simple twin network to learn representations without negative sample pairs, large batch, and momentum encoding.
Contrastive learning in federated learning has recently emerged as an effective approach to solving noniid problems.Some existing approaches use a contrastive loss to compare different image representations, and they can utilize the huge unlabeled data on distributed edge devices [32,33] .Wang et al [34] used a supervised contrastive learning to improve the quality of learned features to solve the long-tail distribution problem in classi-fication tasks.Wang et al [35] explored the application of contrastive federated learning in medical image segmentation.However, they ignored the need for personalized models and did not explore the issue of gradient-based privacy leakage.In contrast to previous work, we introduce model-contrastive learning with the global class center into supervised learning to address the issues of inconsistency in the embedding space for each party.

Leakage from Gradients
It is generally accepted that exchanging gradients across parties will not leak private training data in distributed learning systems, such as collaborative learning [36] and federated learning [2,3] .Recently, Zhu et al [22] proposed a method called DLG, which shows the possibility of obtaining private training data from publicly shared gradients.DLG [22] synthesizes virtual data and corresponding labels under the supervision of shared gradients.The iDLG [23] further demonstrates that the last layer of shared gradients must leak ground truth labels when the activation function is non-negative.Wainakh et al [37] further explored the properties of gradient-based leakage of true labels under large batch.Common techniques for protecting privacy include adding noise, gradient compression, discretization, and differential privacypreserving.But all these methods reduce the model accuracy to different degrees.

Local Drift in Federated Learning
In FedAvg, all parties optimize their models on the local dataset for each training round.Then the server up- where n is the number of parties, D i is the private local dataset of party i, and f i (w) is the expected loss of party i.The overall goal is to obtain a globally optimal model w * on the global dataset There is a drift between the local and global models due to the label distribution skew, a special kind of noniid scene, where each party can only access a partial set of the whole class set [38] .The performance of FedAvg is significantly reduced with the highly skewed non-iid data in FL [13,20,39] , indicating that ignoring local drift results in the deviation of global model.For this purpose, we give a baseline approach called SOLO, in which each party trains the model only by its local data without federated learning.In Fig. 2, we use a simple example to illustrate that a local drift in the party will lead to a biased global model in FedAvg.It assumes that the model has a non-linear transformation function f(e.g., leakyrelu).Suppose w 1 and w 2 are local parameters for party 1 and party 2, x is a data point, and the corresponding outputs for party 1 and party 2 are y 1 = f (w 1 x) and y 2 = f (w 2 x).The parameters of the model generated by Fe-dAvg can then be expressed as w f = w 1 + w 2 .w c is a parameter of the centralized model that can get the ideal output.As shown in Fig. 2, we have w f ¹ w c and f(w f ,x)≠ y 1 + y 2 2 , indicating that the global model in FedAvg is skewed, which may lead to slow convergence and poor accuracy.
Figure 3 shows the precision results of training using only local data sets and MSE (mean square error) distance between the models in SOLO and FedAvg under the same conditions.It indicates that the accuracy cannot be improved obviously, and the inter-party drift becomes more severe as the number of local iterations increases.
In this case, each party should have a personalized model to suit its unique data distribution.It is necessary to correct the local optimization direction from a global perspective to align the local optimization direction with the global optimization direction to improve the effect of FL.

Problem Statement and Motivation
Suppose there are N parties (P 1  ... P n ), where party . The server and parties attempt to jointly learn the parameters of the global representation, while the party tries to learn its unique model locally.The personalized federated learning can solve: where  To further verify this intuition, we now discuss the observations that motivate the correction of local training.We explore a more skewed data imbalance issue: label distribution skew, which means each party could only access a subset of the entire class collection [40] .Specifically, we first train a CNN (Convolutional Neural Network) model on CIFAR-10 as a center model.Then, we partition the dataset into 10 subsets in an unbalanced manner and train a CNN model on each subset as SOLO model, where a subset contains 5 classes of data.We use the t-SNE [41] to visualize the hidden vectors of images from a randomly selected SOLO model and center model as shown in Fig. 4(a) and Fig. 4(b).The SOLO method learns better features, but its clustering degree and clustering centers differ significantly from the global distribution in the ideal condition.This may hinder the accuracy of downstream classification tasks.

Method
Based on the above ideas, we propose MpFedcon, a simple and effective FL framework based on FedAvg.Since there is a fundamental contradiction between local and global optimum, MpFedcon aims to constrain the local update direction to be consistent with the global optima, and further fit its unique data distribution by personalized layers while sensitive information is retained locally.In the following, we present the local network architecture, the global class center, the local objective, and privacy protection based on gradient leakage.

Local Network Architecture
As shown in Fig. 5, the local network consists of three components: a base encoder, a projection head, and an output layer.Specifically, since the heterogeneous data distributed across tasks may share a common repre-sentation, we use the base encoder to extract common representation vectors from inputs to improve the quality of each party model.Then the representation is mapped to a space with a fixed dimension using an additional projection head.We use a multilayer perceptron (MLP) with hidden layers to implement the projection head, which helps to improve the representation of the layers that precede it [31] .At last, the output layer predicts values for each class.Locally retained personalized layers include a projection head and an output layer that protect privacy and adapt to local data distribution.It further mitigates the impact of non-iid on model training.
For ease of representation, with model weight w, we use F w (×), D w (×), P w (×) and O w (×) to denote the entire network, base encoding, projection head, and output layer, respectively.When studying the supervised setup, the base encoder extracts the feature representation from the input x.The feature representation is mapped to the low-dimensional space through the projection head for computing the contrast loss l con .The output layer predicts class-wise logits s, which are used to calculate typical loss terms in supervised learning.The model for P i is composition of its local parameters and the representation: , where h i is the locally retained personalized layers, including a projection head and output layer, and d w denotes a common representation of the base encoder extraction.

The Global Class Center
As shown in Fig. 5, we introduce the global class center as the optimization target for each class from a global perspective.The global server stores and maintains the class centers through a Memory Bank [42] .In the supervised scenario, samples of the same class are restricted to the class center region, thus effectively solving the problem of skewed optimization direction due to the label distribution skew.The classes centers are updated as follows: where x i denotes the samples of class i, m is the number of class samples, P w (×) denotes the feature output of the projection head, c t i is the local class center obtained after training local data in round t.The class center of each party is aggregated and averaged on the server to obtain the global class center c t + 1 , then the server distributes it to participants next round.Aggregated data is more conducive to training federated learning than skewed data.We aim to find more desirable class center locations from a global perspective and thus improve the classifi-cation performance of downstream tasks.

Local Objective
The local loss consists of two parts.The first part is a typical loss term in supervised learning (e.g., cross - entropy loss), denoted as l sup .The second part is our proposed global class center model contrastive loss term, denoted as l con .In the t-th training round, party i receives a common base encoder model θ t and the global class with the locally retained personalized layers h i as the initialized w t i for this round.Let d t i ¬ θ t i  where d t i denotes the initial model parameters in this round and does not participate in the gradient update.Let z glob = c t i represent the class center feature vector of the i-th class.z = P w (x) represents the feature representation from the local model w t i being updated, z con = P d (x) is the mapped representation of input x by the initial model d t i .Since the global model has a more robust representation, we correct the local update direction by reducing z and z glob and increasing the distance between z and z con through the global class center c i .The model contrastive loss is defined as: where τ denotes the temperature parameter, sim (× ×) is the cosine similarity.The local objective is to minimize min where μ is the hyperparameter that regulates the weights of the two terms.The overall algorithm is described in algorithm 1.

Fig. 5 Overview of i-th local network architecture in MpFedcon
The feature extraction network (including the initial encoder, base encoder and MLP) extracts the representation z z con and then the local network is combined with global center features z glob to calculate the contrast loss l con .The output layer FC predicts the class-wise logits to compute the cross-entropy When round t = 0, the server initializes the model w 0 , c 0 and sends them to all clients.In other rounds, the server receives a local base encoder model θ t i from participants, and updates them by weighted average method to obtain θ t + 1 i , then sends it to the participants in the next round.In addition to initialization, the communication process only transmits partial network parameters.In party-side training, the party updates the model w t i using local data via SGD and updates each class center c i .

Privacy Protection Based on Gradient Leakage
Neural network models are usually trained by a hotlabel (one-hot) cross-entropy loss function, which can be defined as: where x is the input data, k is the corresponding ground truth label, M is the number of classes.We have y i = 1 when i = k, otherwise y i = 0.And s =[s 1 s 2  ... ] is the prediction score of input x through neural network, and p i denotes the output of s i after the Softmax activation function.
The gradient vector ÑW i L of the weight W i L connected to the i-th logit can be written as: Based on this independent of the model architecture and parameter rules, it is possible to identify the groundtruth label k of the private training data x from the shared gradient ÑW.In other words, this inference is applicable to any network in any training phase from any random initialization of the parameters [23] .
Gradient-based attacks require access to the complete gradient information, especially in the last layer.An intuitive defense strategy is gradient masking, which transmits incomplete gradient information that does not affect collaborative modeling.We design a layered network structure that locally preserves the gradient information of the personalized layers.For instance, when the last layer is masked, the attacker can only infer the label from the gradient information of the inverted second layer.The gradient vector ÑW i L of the weight W i L -1 connected to the i-th logit in the output layer can be written as: where W ij L denotes the weight parameter of the L -1 layer weight parameter associated with the hidden layer neuron h i L -1 .The sign of ÑW i L -1 is associated with the uncertain value of W ij L , so the gradient information and labeling relationship cannot be accurately determined by the above conclusion.The experimental validation process is described in Section 5.10.

Experiment Studies
To demonstrate the superiority of this work, the MpFedcon is compared with the state-of-the-art federated learning algorithms.The global FL approaches include FedAvg [4] , Fedprox [20] , SCAFFOLD [13] , The personalized FL approaches, such as Per-FedAvg [27] uses metalearning to learn an initial model before adapting to each task to fine-tune it.APFL [21] interpolates between local and global models, and Ditto [19] learns local models and encourages these models to be tightly coupled through global regularization.Fedper [28] , Fedrep [29] are also a layered network architecture, as they learn a global representation and personalization head.However, these methods do not explore privacy protection.We use SOLO as a baseline method.Recall that the SOLO approach involves each party training a model with local data without federated learning.Further, we compare the single global model and its fine-tuned approach.To obtain the fine-tuning results, we first train the global model for the entire training cycle, and then each party fine-tunes its local training data by 10 SGD only, then calculate the final test accuracy.

Experimental Setup
Experiments are conducted over three standard datasets: CIFAR-10, CIFAR-100 [43] and FEMNIST [44] .The heterogeneity of the CIFAR-10 and CIFAR-100 is controlled by assigning different class numbers S to each party.Each party is assigned the same number of training samples.For FEMNIST, the dataset is restricted to 10 handwritten letters, and samples are assigned to the parties according to the log-normal distribution [38] .There is a partition containing 150 parties, with an average of 148 samples/parties.As in the previous work [28] , a 5layer CNN model is used as the base encoder for CIFAR-10 and CIFAR-100, and a 2-layer MLP for FEMNIST.The projection head for all methods consists of a 2-layer MLP, while the output layer is a single linear layer.Mp-Fedcon performs 10 SGD local epochs with momentum to train the local head, followed by one epoch for the base encoder layer in the case of CIFAR-10 and five epochs in all other cases.All other methods use the same number of local epochs as MpFedcon to update the base encoder layer.The accuracy is calculated by taking the average local accuracy of all users in the last 10 rounds of communication.

Accuracy Results
Table 1 lists the top-1 test accuracy of all methods.The SOLO method has better performance results since it can fit the local data preferably, as the data assigned to each party is small and biased.The data skew distribution severely impairs the performance of FedAvg.The SCAFFOLD and FedProx methods based on FedAvg perform much worse than FedAvg, so it may be difficult to find the right direction to correct data heterogeneity.Furthermore, APFL and Ditto outperform the classical FedAvg performance because the hybrid and regularization methods partly bridge the local and global model drift.Surprisingly, the fine-tuned FedAvg method performs well, probably because the fine-tuning adapts to the unique data distribution.Fedper and Fedrep methods based on layered networks further improve the accuracy.However, none of the above methods addresses the inconsistency between local and global optimization objectives due to data heterogeneity, which affects the final performance results.It can be observed that MpFedcon performs the best on the datasets with different degrees of heterogeneity.Subject to similar semantic interference, the MpFedcon has a 0.27% to 0.84% accuracy improvement on CIFAR-100 with hyper-classification, more than 1.65% on CIFAR-10, and more than 4.74% on FEMNIST.It shows that MpFedcon effectively improves the federated learning effect.

Effect of Data Heterogeneity
To evaluate the effect of heterogeneity, we control the degree of heterogeneity of each party by varying the number of classes S. For the CIFAR dataset, the number of training samples per party is equal to 50 000/n, where n represents the number of parites, so columns with 100 parties have 500 training samples per party.Comparatively, columns with 1 000 parties have only 50 training samples per party.As we can see from Table 1, MpFedcon always achieves the best accuracy in all cases.The advantage of MpFedcon is the introduction of class centers, which can be used as global knowledge to correct local training, and personalized classification layers further fit local data to improve the classification.

Impact of Global Communication Rounds (T)
Figure 6 shows the accuracy of each round during the training period.MpFedcon achieves the best performance at the end of training.In addition, the curves in Fig. 6 show that MpFedcon sacrifices the convergence speed in the early stages because the learning class central features affect the overall optimization direction in the early stages.The FedAvg, Fedprox, and SCAFFOLD converge slowly and fluctuate greatly with increased communication rounds.It shows that the methods of sharing the same network or modifying the gap between the local and global networks are not applicable under heterogeneous settings.Although Fedper and Fedrep, based on simple layered networks, learn quickly in the early stage, MpFedcon performs better in the later stages.In other words, a better class centered representation gives the classifier better classification ability at a later stage.

Influence of Local Epoch Number (E)
We study the influence of local epoch numbers on the accuracy of the final model.Figure 7 shows the effect on accuracy and convergence speed during training.The accuracy and convergence speed are reduced when the number of local epochs is 1, especially FedAvg.It can be observed in Fig. 8 that when the number of local epochs E = 10 most methods have the highest accuracy and faster convergence.This is because when E is small, the local networks cannot be fully trained and converge slowly.However, the improvement of accuracy and convergence speed will be slight when E > 10, and there may be overfitting for local training of skewed data, which leads to a decrease in the accuracy of the global model.

Scalability
To demonstrate the scalability of MpFedcon, we use different numbers of parties to participate in the training on the CIFAR-10 dataset.Specifically, we try two settings: 1) the dataset is divided into 50 parties and 5 parties per round are randomly selected; 2) the dataset is divided into 100 parties and 10 parties in each round of federated training are randomly selected.The results are shown in Table 2.For MpFedcon, the results are shown for μ = 0.5 5 10, which best outperforms Fedrep with over 2% accuracy at 200 rounds with 50 parties and 5% accuracy at 200 rounds with 100 parties.Partial party participation means that the active data is only a subset of all training data, which leads to unstable training and slower convergence.MpFedcon consistently achieves the best performance with the participation of different parts in Table 2, which shows that the performance of MpFedcon will not be affected by the increases in the number of parties.

Effect of Coefficient in the Loss Function (μ)
In this work, we use the coefficient μ to adjust the weights of the classes centers feature learning and classifier learning during training.Different coefficient μ of experiments are explored on CIFFAR-10.Specifically, μ is a hyperparameter used to weigh the class centers optimization direction against its dataset  s optimization di-rection.As shown in Table 2, MpFedcon achieves the best results when μ=10.A smaller coefficient μ increases the fitting effect of the personalization layer on a small amount of local data, thus improving the model accuracy.While a larger μ slows down the convergence in the short term, it improves the overall classification effect in subsequent exchanges.

Communication Efficiency
The communication overhead of federated learning is mainly caused by the transfer of data (e.g., models, parameters) between the party and the central server.Many current studies focus on studying the reduction of one aspect, such as reducing the number of communications without caring about the cost of a single transmission.We believe that a more credible metric for judging the communication cost is the total amount of communication data at convergence.It can be expressed as: Traffic = round * traffic 1 (10) where Traffic is the total communication volume, rounds is the number of communications, and traffic 1 denotes a single traffic volume.For a fair comparison, each algorithm uses the same network structure with single traffic of 1.2 and 2.2 MB for CIFAR-10 and CIFAR-100, respectively.
FedAvg reduces the number of communications by increasing the number of local updates.FedAvg algorithm converges under both iid data and non-iid data.However, the convergence speed of FedAvg is limited by the distribution state of the dataset.As shown in Table 3, the most representative algorithms are compared in heterogeneous environments to obtain the same accuracy.FedAvg sacrifices the communication cost to improve the models accuracy.Fedprox and Fedrep have the same single-transfer cost as FedAvg, benefiting from

Effectiveness of MpFedcon
For demonstration purposes, we use the SOLO and the most classical FedAvg method to evaluate the effectiveness of MpFedcon.We take the SOLO method of each party as the test baseline, then Fig. 9 visualizes the improvement of each party after passing the MpFedcon and FedAvg.As shown in Fig. 9, MpFedcon effectively improves precision for more than 70% participating parties in a highly heterogeneous setting.However, the classic Fedavg has almost no accuracy improvement for the participating parties.The classical FedAvg approach almost fails in the case of data heterogeneity.

Gradient Leakage-Based Privacy Defense
For a fair comparison, experiments are shown on the CIFAR-10 and CIFAR-100 datasets for the classification tasks according to the settings in iDLG [22] .LeNet is initialized with random for all experiments, and we use L-BFGS [45] with a learning rate of 1 as the optimizer.
The gradient attack is visualized by the same conditions for the same random image as in Fig. 10.The curve represents the MSE between the generated image and the real image.Then we visualize the final image generated by each method.MpFedcon masks the gradient information of the classification layer, and the attacker cannot accurately know the number of parties  personalized layers and sensitive information.To effectively test the possible cases of gradient attack, the experiments verify the effect of gradient attack by setting cl/at, where cl is the number of FC layers on the party side, at is the number of FC layers on the attacker side.Even with only 1 FC layer, the attacker still fails to identify effectively after many iterations.However, the DLG and iDLG methods accurately restore the image after 50 rounds of iterations.
In addition, it can be seen from Fig. 10 that the more FC layers the party masks, the bigger the error caused and the more difficult it is to be attacked.Table 4 shows the traditional defenses based on Gaussian noise and Laplace noise with a large variance of 10 -2 effectively defend against noise defense, but both severely degrade the accuracy [22] .The results show that MpFedcon effectively resists the privacy leakage problem of gradient-based attacks while ensuring the model  s accuracy.

Conclusion
Non-iid is a significant obstacle to the availability of federated learning.To improve the performance of federated learning models on non-iid datasets, we pro-

Fig. 3
Fig. 3 Impact of different epochs when the party uses only local data "ep" represents the number of local epochs; The bar chart shows the MSE distance of the SOLO and FedAvg models; The curves indicate the accuracy of the different local epochs for each round Figure 4(c) shows the representation learned by the FedAvg algorithm.We can observe that the points with the same class are more confused in Fig. 4(c) compared with Fig. 4(a).The FedAvg even leads the model to learn a worse representation due to the skewed local data distribution.This further verifies that the inconsistency of local and global data distribution will significantly affect the performance of federated learning.MpFedcon corrects the local update direction by introducing a global class center from the perspective of global clustering.As shown in Fig.4(d), the local party data are restricted to the same region as the global distribution after the MpFedcon method, so there is space further to improve the aggregation effect of the central model and enhance the classification effect of downstream tasks.

Fig. 10 Fig. 9
Fig. 10 The effectiveness of various defense strategies

l sup Algorithm 1: The MpFedcon framework
Input: number of communication rounds T, number of parties n, number of local epochs E, participation rate r, step size α, temperature τ, learning rate η, hyper-parameter μ, number of local updates for the common representation τ AE , number of local updates for the head τ h

Table 1 The top-1 accuracy of MpFedcon and the other methods on test datasets %
* means the number of clients is 100 and the number of classes is 2

Table 4 Testing datasets performance %
Note: G means Gaussian noise, and L means Laplace noise