Improved Hybrid Collaborative Fitering Algorithm Based on Spark Platform

Zhen YOU; Hongwen HU; Yutao WANG; Jinyun XUE; Xinwu YI

doi:10.1051/wujns/2023285451

All issues

Volume 28 / No 5 (October 2023)

Wuhan Univ. J. Nat. Sci., 28 5 (2023) 451-460

Full HTML

Open Access

Issue		Wuhan Univ. J. Nat. Sci. Volume 28, Number 5, October 2023


Page(s)		451 - 460
DOI		https://doi.org/10.1051/wujns/2023285451
Published online		10 November 2023

Wuhan University Journal of Natural Sciences, 2023, Vol.28 No.5, 451-460

Information Technology

CLC number: TP301.6

Improved Hybrid Collaborative Fitering Algorithm Based on Spark Platform

Zhen YOU¹^,2, Hongwen HU¹, Yutao WANG¹, Jinyun XUE² and Xinwu YI¹

¹ College of Computer Information Engineering, Jiangxi Normal University, Nanchang 330022, Jiangxi, China
² National-Level International Science and Technology Cooperation Base of Networked Supporting Software, Nanchang 330022, Jiangxi, China

Received: 1 July 2022

Abstract

An improved Hybrid Collaborative Filtering algorithm (H-CF) is proposed, addressing the issues of data sparsity, low recommendation accuracy, and poor scalability present in traditional collaborative filtering algorithms. The core of H-CF is a linear weighted hybrid algorithm based on the Latent Factor Model (LFM) and the Improved Item Clustering and Similarity Calculation Collaborative Filtering Algorithm (ITCSCF). To begin with, the items are clustered based on their attribute dimension, which accelerates the computation of the nearest neighbor set. Subsequently, H-CF enhances the formula for scoring similarity by penalizing popular items and optimizing unpopular items. This improvement enhances the rationality of scoring similarity and reduces the impact of data sparseness. Furthermore, a weighting function is employed to combine the various improved algorithms. The balance factor of the weighting function is dynamically adjusted to attain the optimal recommendation list. To address the real-time and scalability concerns, the algorithm leverages the Spark big data distributed cluster computing framework. Experiments were conducted using the public dataset MovieLens, where the improved algorithm's performance was compared against the algorithm before enhancement and the algorithm running on a single machine. The experimental results demonstrate that the improved algorithm outperforms in terms of data sparsity, recommendation personalization, accuracy, recall, and efficiency.

Key words: recommendation algorithm / collaborative filtering / latent factor model / score weighting / item clustering / spark / similarity calculation

Biography: YOU Zhen, female, Associate professor, research direction: software formalization, concurrent distributed computing, virtual reality, big data algorithm. E-mail: youzhenjxnu@163.com

Fundation item: Supported by the Natural Science Foundation of Jiangxi Province (20212BAB202018), Provincial Virtual Simulation Experiment Education Project of Jiangxi Education Department (2020-2-0048) and the Science and Technology Research Project of Jiangxi Province Educational Department (GJJ210333)

© Wuhan University 2023

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

With the rapid development of Internet information technology, people have entered the era of big data, leading to an explosive growth of data information. Faced with this vast amount of information resources, people often have to invest significant time and effort in filtering the content they are interested in. To address this issue of information overload^[1], the recommendation algorithm emerged^[2]. Serving as a novel form of implicit information service, the recommendation algorithm has found extensive application in the Internet industry, yielding favorable outcomes in e-commerce, short video platforms, and other domains. Furthermore, it substantially reduces the cost of user information retrieval.

Among the recommendation algorithms, the collaborative filtering algorithm is widely used^[3], and it can be categorized into project-based collaborative filtering^[4], user-based collaborative filtering^[5], and model-based collaborative filtering^[6]. The fundamental concept of this algorithm is to calculate user similarity or project similarity solely based on user's interactive behavior data without extracting project characteristics. This allows us to recommend content that might be of interest to users. However, traditional collaborative filtering also has some shortcomings, such as data sparsity, low accuracy, lack of timeliness, and scalability issues^[7]. Numerous scholars have conducted extensive research on recommendation algorithms and systems. Liu et al^[8] proposed a multi-factor weight collaborative filtering recommendation algorithm. They introduced time data and penalty factors into the similarity coefficient, resulting in a new similarity calculation method. Moreover, by considering the dynamic changes in interest factors, the recommendation accuracy has significantly improved. Nonetheless, this approach suffers from high calculation complexity and unclear effectiveness when dealing with sparse data. Tao et al^[9] proposed a recommendation algorithm based on grey correlation clustering, which employs grey correlation degree to determine user similarity. This method guarantees recommendation quality in high sparse dimensions but can only handle datasets with internal correlations between users and projects. Other approaches, like Kullback-Leibler (KL)-based Similarity Measure (KLCF)^[10], incorporate item similarity weight into the user similarity formula, enhancing the accuracy of similarity to some extent. However, this involves a substantial number of Cartesian products, leading to high calculation costs. Zhang et al^[11] proposed an explicit implicit feedback algorithm with a similarity weighting strategy, which effectively improved the recommendation accuracy based on the differential privacy index. However, it failed to improve accuracy when the dataset was sparse. Most of the mentioned references focus on enhancing recommendation accuracy by improving similarity, but they only address a single direction, neglecting the simultaneous problems of low accuracy and data sparsity. Furthermore, with the exponential growth of data volume, the previous stand-alone recommendation models struggle to handle large-scale data recommendation calculations. The processing efficiency of stand-alone computing is low, resulting in poor scalability and extended real-time recommendation processing.

To address the issues of data sparsity, low accuracy, and poor scalability in traditional collaborative filtering algorithms, we propose an enhanced Hybrid Collaborative Filtering algorithm (H-CF). This algorithm incorporates several improvements. Firstly, we cluster item attributes to accelerate the calculation time of nearest neighbor sets. Secondly, we enhance the formula for scoring similarity calculation to increase the similarity between items and the actual values, thereby reducing the impact of data sparsity and enhancing the accuracy of recommendations. Thirdly, the two improved algorithms are combined with linear weights using a balance factor, allowing us to fully explore users' personalization and potential preferences in order to obtain the optimal recommendation list. Finally, we leverage the advantages of Spark distributed computing and algorithmic clustering techniques to enhance the scalability of the recommendation system and the responsiveness of the algorithm.

1 Collaborative Filtering Algorithm

1.1 Alternative Least Squares(ALS) Algorithm Based on Latent Factor Model(LFM)

LFM^[12] is a type of matrix factorization, belonging to the realm of machine learning algorithms. It incorporates latent factors to establish the relationship between user interests and items. The fundamental idea of this model lies in breaking down the high-dimensional data matrix of user ratings on items. To tackle this problem, the key algorithm employed is the ALS algorithm, which involves three main stages. Firstly, the user-item rating matrix is constructed. Then, the rating matrix is decomposed into the product of two low-rank matrices. Finally, ratings are predicted and recommendations are made. The calculation process is as follows.

1) Build user item rating matrix: Build a rating matrix R_mn

As shown in Fig. 1, where R_ui represents user u interest score for item i, there are m rows and n columns. Since it is not necessary that each user rates all items, R_mn is often a sparse matrix.

Fig. 1

LFM principle user item rating representation diagram

2) ALS matrix dimension reduction calculation: For the matrix R_mn, find the product $\tilde{R}$ of two low-rank matrices X_mf and Y_{f n} to be approximate, where X_mf is the user's implicit preference matrix for the item, Y_{f n} is the implicit feature matrix contained in the item, f is the number of hidden class features, and f ≤min(m, n). The formula is as follows:

${\tilde{R}}_{m n} = X_{m f}^{T} \cdot Y_{f n}$ (1)

In order to make the product $\tilde{R}$ as close as possible to R_mn, a loss function that minimizes the squared error is used. And to prevent overfitting, a regularization term is added:

$m i n C (X, Y) = \sum_{u, i} (r_{u i} - x_{u}^{T} \cdot y_{i})^{2} + λ \sum_{u} {‖ x_{u} ‖}^{2} + λ \sum_{i} {‖ y_{i} ‖}^{2}$ (2)

Among them, x_u is the implicit feature vector of user u preference for the item, y_i is the implicit feature vector of item i, r_ui is the actual rating of the i-th item by the u-th user, $x_{u}^{T} \cdot y_{i}$ is the approximate score of the item of user u, and $λ \sum_{u} {‖ x_{u} ‖}^{2} + λ \sum_{i} {‖ y_{i} ‖}^{2}$ is the regularization term, where the regularization coefficient $λ$ can be obtained by cross-validation.

Because there is a double variable x_u and y_i in formula (2), there exists a coupling, so we use the ALS alternating least squares method to solve it. Fixing X, finding the partial derivative of y_i by the loss function C(X,Y) and setting the partial derivative equal to 0, we can get:

$\frac{\partial C (X, Y)}{\partial y_{i}} = - 2 X_{m f} r_{i} + 2 X_{m f} X_{m f}^{T} y_{i} + 2 λ y_{i} = 0$ (3)

$y_{i} = (X_{m f} X_{m f}^{T} {+ λ E)}^{- 1} X_{m f} r_{i}$ (4)

In the same way, fixing Y and taking the partial derivative of x_u, we can get:

$x_{u} = (Y_{f n} Y_{f n}^{T} {+ λ E)}^{- 1} Y_{f n} r_{u}$ (5)

The equations (4) and (5) are calculated repeatedly in sequence, and the root mean square error (RMSE)^[13] is introduced as the condition parameter for terminating the iteration.

3) Prediction score: Convergence of the result can be determined when the RMSE value experiences slight fluctuations and reaches a certain level of accuracy, or when the maximum number of iterations is reached. For prediction and scoring, it is recommended to use the final training result matrix following formula (6).

$P_{a l s} = \tilde{R} = X^{T} Y$ (6)

1.2 Item-Based Collaborative Filtering Algorithm

The Item-Based Collaborative Filtering algorithm (Item CF) is based on the similarity of user ratings of items and recommends items using the nearest neighbor set as a reference. It suggests items similar to the ones the user has previously liked. The dataset used is the user-item rating data. Referring to the R_mn rating matrix in Fig. 1, where R_ui represents the user u's interest score for item i. Since each user has only rated a subset of items, the matrix is sparse. The implementation process of the algorithm is divided into two stages. Firstly, the similarity between items is calculated to obtain the nearest neighbor set. Subsequently, scores are predicted and recommendations are made accordingly. The procedure is as follows:

1) Similarity calculation: The similarity calculation constitutes the most crucial component of the algorithm. Its primary objective is to characterize the similarity between items and derive the nearest neighbor set, a pivotal factor in enhancing recommendation accuracy. To achieve this, a modified cosine similarity is employed for the computation. This choice is made due to the variation in score scales among individual users, necessitating score normalization to eliminate the user dimension. The formula is as follows:

$s i m (i, j) = \frac{\sum_{u \in U_{i, j}} (R_{u i} - \bar{R_{u}}) (R_{u j} - \bar{R_{u}})}{\sqrt[]{\sum_{u \in U_{i, j}} (R_{u i} {- \bar{R_{u}})}^{2} \sum_{u \in U_{i, j}} (R_{u j} {- \bar{R_{u}})}^{2}}}$ (7)

Among them, the larger the sim(i,j), the higher the similarity between items i and j. The set U_i,j describes the user who has evaluated i and j at the same time, R_ui describes the user u rating for item i, $\bar{R_{u}}$ describes the average rating of all rated items for user u.

2) Predicted score: The score is calculated based on the similarity between the user's unrated items and the items in the nearest neighbor set. The scoring model formula is as follows:

$P_{i t e m} = P (u, i) = \bar{R_{i}} + \frac{\sum_{j \in N_{i}} s i m (i, j) (R_{u j} - \bar{R_{u}})}{\sum_{j \in N_{i}} | s i m (i, j) |}$ (8)

where P(u,i) represents the predicted score of item i by user u, N_i represents the top n neighbors with the largest similarity to item i, and R_i represents the mean score of item i.

2 Improved Hybrid Collaborative Filtering(H-CF) Algorithm

Data sparsity, accuracy, and scalability have consistently been the most prominent issues in recommendation algorithms. Traditional single collaborative filtering algorithms are unable to comprehensively address these problems, nor can they enhance individual performance unilaterally. To address these limitations, this paper introduces an enhanced H-CF algorithm that effectively compensates for these shortcomings. The key components of the proposed approach include item clustering based on the traditional Item CF algorithm, enhancement of scoring similarity, and linear weighted fusion of the improved algorithm.

2.1 Improved Collaborative Filtering Algorithm Based on Item CF

To reduce the impact of data sparsity and computational complexity, an enhancement of the traditional Item CF algorithm has been designed. It is named the Improved Item Clustering and Similarity Calculation Collaborative Filtering Algorithm (ITCSCF). This algorithm is also a sub-algorithm within the final hybrid collaborative filtering algorithm. Firstly, items are clustered based on their attributes using clustering techniques, which narrows the search scope of the nearest neighbor set, reducing the complexity of similarity calculations. Then, the formula for rating simi-larity calculation is improved. Given the different interests of users in popular and unpopular items when data is sparse, a penalty is applied to popular items, and parameters for unpopular items are optimized, reducing the impact of data sparsity and enhancing the accuracy of similarity.

2.1.1 Item clustering

Item CF needs to traverse the entire item data set when finding the nearest neighbor of the target item. For matrix data with m rows and n columns, the time complexity of the traditional collaborative filtering algorithm is O(n*m*m)^[14]. To mitigate the impact of data sparsity and reduce computational complexity, items can be scored and clustered. The approach involves selecting M points from the items as the initial cluster centers, then traversing the similarity between all items and the center points. Items with the highest similarity are assigned to their corresponding clusters. Next, the average value in each cluster is calculated and used to update the current cluster center.

This process is iterated repeatedly until the center point remains unchanged. As a result, the neighborhood calculation is reduced from the entire item space to several clusters, significantly reducing computational complexity. The time complexity after clustering is O(m*k*t), where m represents the number of data points, k represents the number of class center points, and t represents the number of cyclic iterations. The clustering algorithm is presented as follows:

Clustering algorithm

Input: Scored item set UIDB and number of clusters M

Output: M cluster classes

Step 1: Select the M starting cluster center points, traverse the UIDB of the item set, calculate the number of ratings S of each item i, sort the S of all items, and take the first M items with the largest S value as the starting cluster class , the cluster center set is recorded as C={c₁,c₂,…,c_m}; the cluster set where the center is located is recorded as U_C={C₁,C₂,…,C_m}.

Step 2: Traverse and calculate the cosine similarity sim(i, c_j) of all items i and the cluster center point c_j.

Step 3: Divide each item i into the cluster set C_j where the center with the highest similarity is located, and calculate the average value of the clusters as the latest cluster center point.

Step 4: Repeat the above steps above the loop until the position of each center point remains unchanged, end, and return the result.

Item clustering involves grouping similar items into the same cluster. When computing the nearest neighbor set, user only need to select the first few clusters that exhibit the highest similarity to the target item, and then perform the search within those clusters. This approach significantly reduces computational complexity and enhances the efficiency of the algorithm. Additionally, utilizing cosine similarity calculation in clustering helps mitigate the issue of "similarity not being the same", which can arise due to variations in individual attributes.

2.1.2 Improved scoring similarity calculation formula

At the heart of the collaborative filtering algorithm lies the similarity calculation, and its accuracy directly impacts the quality of recommendations. The varying degrees of user interest in popular and unpopular items can influence the precision of similarity calculation. To address this, this paper introduces two enhancements to the similarity calculation formula. The details are as follows:

1) Tuning of unpopular projects

In formula (7), the user set U_ij, which contains scores for two items i and j simultaneously, is used to calculate the similarity. However, when the data is sparse, the number of users who provide joint scores in U_ij will be very small, leading to significant deviations in the similarity calculation. For example, when there is only one user in U_ij, the result of calculating the similarity is 100%. No matter whether the two items are similar or not, they will be selected into the N_i set, which directly interferes with the calculation accuracy of P(u,i). In order to solve this problem, multiply a g(x) weight function on the basis of the above formula. g(x) is an increasing function of x, and it converges with the growth of x, because only convergence can ensure that the calculation result of the similarity after weighting has the least interference, and the calculation accuracy of P(u,i) is the highest. To sum up, choose g(x)=lg(1+x), and the improved formula is as follows:

$s i m {(i, j)}_{n e w} = l g (1 + x) \frac{\sum_{u \in U_{i, j}} (R_{u i} - \bar{R_{u}}) (R_{u j} - \bar{R_{u}})}{\sqrt[]{\sum_{u \in U_{i, j}} (R_{u i} {- \bar{R_{u}})}^{2} \sum_{u \in U_{i, j}} (R_{u j} {- \bar{R_{u}})}^{2}}}$ (9)

where x represents the number of public users in U_ij. When x is very small, a small weight is added to the similarity calculation to ensure it is not heavily influenced by the N_i set. In the rare event it is included, its impact on the accuracy of P(u,i) calculation will be minimal. Therefore, by multiplying with g(x), the resulting similarity value will be closer to the actual similarity.

2) Popular item punishment

In formula (7), if item j is extremely popular and receives ratings from numerous users, then every other item will appear very similar to the popular item. This, in turn, can significantly influence the ratings of less popular items. To mitigate this issue, we apply a penalty to reduce the impact on the similarity calculation. The following formula:

$P P_{j} = \frac{| U_{i, j} |}{\sqrt[]{| U_{i} | \times | U_{j} |}}$ (10)

$s i m {(i, j)}_{n e w} = l g (1 + x) \times \frac{\sum_{u \in U_{i, j}} P P_{j} (R_{u i} - \bar{R_{u}}) (R_{u j} - \bar{R_{u}})}{\sqrt[]{\sum_{u \in U_{i, j}} (R_{u i} {- \bar{R_{u}})}^{2} \sum_{u \in U_{i, j}} (R_{u j} {- \bar{R_{u}})}^{2}}}$ (11)

where PP_j is the penalty for popular item j, |U_i_,_j| is the number of users who have rated items i and j at the same time, |U_i| is the number of users who have rated item i, and |U_j| is the number of users who have rated popular item j. The penalty PP_j is taken into the calculation when calculate the item similarity, as shown in formula (11).

2.2 Hybrid Collaborative Filtering Algorithm Based on ITCSCF and LFM

To fully leverage user personalization and enhance recommendation accuracy, the H-CF algorithm proposed in this paper primarily utilizes ITCSCF and LFM to apply linear weighting of scores. Specifically, the clustering technology employed in the ITCSCF algorithm reduces computational complexity and significantly improves the recommendation response rate. Additionally, the enhanced similarity calculation effectively mitigates the impact of data sparsity, resulting in improved recommendation accuracy. To accommodate the diverse preferences of individual users, the algorithm maximizes the potential of ITCSCF personalization and LFM preferences by linearly combining the scores from both algorithms. This approach enhances the accuracy of prediction score calculations and increases the interpretability of recommendations. Moreover, the algorithm takes advantage of Spark's big data distributed computing and storage capabilities to address scalability issues faced by traditional recommendation methods. Building upon clustering, it further enhances data processing capabilities and reduces computing time. The algorithm flow is depicted in Fig. 2.

Fig. 2

A hybrid collaborative filtering algorithm model based on ITCSCF and LFM

The final recommendation result is the hybrid weighted prediction score based on ITCSCF and ALS in the LFM, and the score expression is shown in equation (12).

$P_{h - c f} = α \times P_{a l s} + β \times P_{i t e m}, α + β = 1$ (12)

where P_h-cf represents the prediction score of the hybrid recommendation algorithm, P_als represents the prediction score of the ALS algorithm, α is the P_als weighted balance factor, P_item represents the improved ITCSCF algorithm prediction score, and β is the P_item weighted balance factor.

In regard to weight values, cross-validation is employed to train the model and capture variations among sub-algorithms in diverse environments. This approach dynamically adjusts the weights and employs iterative weighted regression calculations, where data to be predicted is sequentially fed into the regression model during each iteration. The tag value is continuously updated until the pre-defined target value is achieved^[15].

3 Spark Big Data Distributed Implementation

3.1 Spark Big Data Distributed Computing Platform

Spark^[16] is a parallel distributed computing framework based on Resilient Distributed Datasets (RDD). Its most prominent feature is the in-memory computing capability of RDD, which stores intermediate data generated during the calculation process in memory, effectively reducing disk I/O overhead. As a result, Spark is particularly well-suited for iterative data processing, and its distributed approach represents the optimal means to enhance system scalability. The Spark framework structure is illustrated in Fig. 3.

Fig. 3

Spark framework

The Driver Program serves as a task control node responsible for processing user code logic. The Cluster Manager is in charge of managing cluster resources, scheduling tasks and monitoring their progress. The Worker Node functions as a computing node within the cluster, capable of executing tasks concurrently using multiple threads, which constitutes one of its key advantages. The Executor is tasked with executing individual tasks, while the Task represents a computing sub-unit within a node. Upon task execution, the Driver requests resources from the Cluster Manager and delegates them to the Executor for processing. Once the execution is completed, the resulting data is returned to the Driver.

3.2 Algorithm Distributed Implementation

Different from the traditional recommendation algorithms, the H-CF algorithm adopts the Spark big data distributed framework, deploying computing and data storage across multiple servers. It leverages Spark's in-memory computing and distributed multi-threading capabilities, which enable maximum computational efficiency for both model training and offline computing. This framework offers two key advantages: 1) All operations are based on RDD in-memory computing, making it highly suitable for a large number of iterative recommendation algorithms and significantly accelerating computing efficiency; 2) The Spark framework supports various distributed structures, including concurrent computing, real-time stream processing, and data storage, effectively enhancing the scalability of recommender systems. The distributed flow of the algorithm is illustrated in Fig. 4.

Fig. 4

Distributed computing process of H-CF algorithm on Spark platform

Initially, data is gathered from distributed data sources, and items undergo pre-classification through clustering, thereby reducing the computational requirements of the nearest neighbor set. Next, by taking into account the popularity of items and data sparsity, an enhanced cosine similarity measure denoted as sim is employed to determine the similarity between items. Various sub-algorithms are then distributed to compute the predicted scores, resulting in individual temporary score lists. Ultimately, weight parameters are calculated through iterative regression to obtain the optimal recommendation list, which is subsequently stored in the distributed database.

4 Experiments and Results Analysis

4.1 Experimental Dataset

The experiment utilizes the MovieLens^[17] public dataset provided by Grouplens to evaluate the algorithm. The dataset comprises three main scales: 100 KB, 1 MB, and 10 MB, representing 100 000, 1 000 000, and 10 000 000 ratings, respectively. The dataset contains user information, movie information, and user ratings for each movie item. The ratings range from 1 to 5, where higher values indicate a greater level of user interest in the respective item.

4.2 Experimental Environment

The experimental system is deployed on 5 Spark distributed clusters, of which 1 is the master node and the other 4 are slave nodes. The hardware configuration is as follows: CentOS7.6; Intel Xeon E5-2650 v3 CPU 2.3GHz; 32G memory. The experimental software versions are as follows: Java JDK 1.8.0.4; Spark 2.1; Zookeeper 3.4 (cluster management).

4.3 Evaluation Standard

1) RMSE

To assess the accuracy of the algorithm, the RMSE is utilized for evaluation. The concept involves calculating the difference between the predicted score generated by the algorithm and the actual score. The smaller the deviation value, the more accurate the recommendation. Here, $\tilde{R}$ represents the predicted value, and R denotes the actual value. The formula is as follows:

$R M S E = \sqrt[]{\frac{1}{N} \sum_{i = 1}^{N} {(R - \tilde{R})}^{2}}$ (13)

2) Recall rate and accuracy rate

The evaluation of algorithm recommendation quality and personalization is typically done using recall and precision. The recall rate measures the proportion of the user's interested items in the recommendation list out of all the user's interested items in the system. The precision rate, on the other hand, measures the proportion of items in the recommendation list that users are interested in, out of all the items recommended to users. It is worth noting that recall rate and precision rate are often inversely correlated. The formula is as follows:

$R_{r e c a l l} = \frac{\sum_{u \in U} | R (u) ⋂ T (u) |}{\sum_{u \in U} | T (u) |}, P_{p r e c i s i o n} = \frac{\sum_{u \in U} | R (u) ⋂ T (u) |}{\sum_{u \in U} | R (u) |}$ (14)

Among them, R_recall is the recall rate, P_precision is the accuracy rate, R(u) is the list of recommended items that the user has evaluated and interacted with in the training set, and T(u) is the list of items that the user has evaluated and interacted with in the test set.

4.4 Experimental Results and Analysis

To assess the performance of the H-CF algorithm, we will conduct experiments on the algorithm from various perspectives.

Experiment 1: The impact of weighted balance factor α on different data sparsity levels. The value of the weighted balance factor α plays a crucial role in achieving the best comprehensive prediction score for the H-CF algorithm. In this experiment, a 1 MB dataset is taken as a sample, and three levels of data sparsity are selected for both training and testing. The ratios of the training set to the test set are 9:1, 7:3, and 5:5, respectively. By controlling other variables, we compare the effects of different α values on RMSE value, and the results are depicted in Fig. 5.

Fig. 5

Effect of different data sparsity balance factor on RMSE

The experimental results indicate that a significant enhancement in the recommendation accuracy of the H-CF algorithm can be achieved by increasing the Pals scoring weight of the LFM when the density of scoring data is high. Similarly, in sparse data scenarios, elevating the Pitem score weight of the ITCSCF algorithm can effectively address accuracy issues.

Experiment 2: A comparison between the H-CF Algorithm and traditional algorithms. To assess the recommendation accuracy of the H-CF algorithm, we compared the RMSE values of its predicted scores with those of several other classical algorithms. During the experiment, we utilized the optimal parameters for each algorithm within this specific environment. For testing purposes, we employed a 1 MB data sample to uniformly select various proportions for both the training set and test set, resulting in different levels of data sparsity. The outcomes of this comparison are depicted in Fig. 6.

Fig. 6

Comparison of root mean square error between H-CF algorithm and traditional algorithm

The experimental results indicate that the enhanced H-CF algorithm outperforms the Item CF and LFM algorithms by 8.7% and 7.6% in terms of RMSE values when considering a 5:5 ratio and sparse data. Moreover, at a 9:1 ratio, the H-CF algorithm exhibits an improvement of 6.3% and 2.8% over the RMSE values of the other two algorithms. A lower RMSE value corresponds to higher recommendation accuracy, and the H-CF algorithm consistently achieves a lower RMSE value across various data volume ratios, highlighting its superior performance in both recommendation accuracy and data sparsity.

Experiment 3: The variations in the recommended recall rate and accuracy rate of the H-CF algorithm were examined under different data scales. The results are depicted in Fig. 7.

Fig. 7

Changes in precision and recall

The experimental results indicate that as the data scale increases, the accuracy of recommendations also improves, but the recall rate decreases. This observation suggests that, under unchanged conditions, a higher density of high-rated data corresponds to a greater user interest in popular items, resulting in more accurate recommendations but lower item novelty. These factors exhibit an inverse relationship.

Experiment 4: Algorithm execution time-consuming test, in which the enhanced H-CF algorithm is deployed on 5 Spark clusters and 1 stand-alone machine, respectively, and their respective running times are compared. The experimental training set to test set ratio is set at 9:1, and the data scale is gradually increased for testing purposes. The results are presented in Fig. 8.

Fig. 8

Comparison of computing time between single machine and Spark cluster under different data scales

As shown in Fig. 8, the operation efficiency is significantly improved under the Spark cluster compared to stand-alone operation. As the data scale increases, the gap in uptime between them also grows. There are two main reasons: 1) The clustering algorithm processing in the early stage reduces the time complexity from O(n*m*m) to O(m*k*t), thus saving iteration time; 2) Communication and data transmission between Spark cluster nodes consume running time. While the difference is negligible when the data scale is small, it significantly accelerates the computing efficiency when dealing with large-scale data. Moreover, the improved algorithm's scalability is greatly enhanced by the combination of the algorithm and the Spark cluster, surpassing the traditional algorithm in this aspect.

5 Conclusion

An improved H-CF algorithm has been proposed to address the issues of data sparsity, low accuracy, and poor scalability faced by traditional collaborative filtering algorithms. This enhanced approach can be summarized in three key aspects: 1) Building upon the foundation of the traditional Item CF algorithm, we cluster item attributes, which significantly reduces the calculation time of the nearest neighbor set; 2) Leveraging the cold and hot characteristics of the item formula, we enhance the score similarity calculation within the item CF algorithm. By doing so, we mitigate the impact of data sparsity and bring the similarity between items closer to their actual values; 3) By skillfully combining the newly devised ITCSCF algorithm and the LFM, we apply linear weighting, effectively leveraging their individual strengths in personalization and potential preferences. Moreover, we dynamically adjust the balance factor to obtain the optimal recommendation list. These improvements collectively contribute to tackling data sparsity, enhancing accuracy, and boosting the scalability of the collaborative filtering process.

The experimental results demonstrate that the enhanced hybrid collaborative filtering algorithm achieves higher recommendation accuracy, improved personalization, and enhanced computational efficiency compared to the conventional Item CF algorithm and LFM model algorithm under identical conditions. Moreover, leveraging the advantages of the Spark distributed platform and algorithm clustering technology, data sparsity and scalability have been noticeably enhanced. Nevertheless, it should be noted that the algorithm's clustering and weighted balance factor parameters are crucial factors influencing the recommendation accuracy. For this reason, these parameters will be subjected to thorough testing and fine-tuning in future experiments and research endeavors.

References

Chen J F, Yuan Y, Ruan T, et al. Hyper-parameter-evolutionary latent factor analysis for high-dimensional and sparse data from recommender systems[J]. Neurocomputing, 2021, 421: 316-328. [CrossRef] [Google Scholar]
Yan J, Zeng Q T, Zhang F Q. Summary of recommendation algorithm research[J]. Journal of Physics: Conference Series, 2021, 1754(1): 012224. [NASA ADS] [CrossRef] [Google Scholar]
Chen Y C, Hui L, Thaipisutikul T. A collaborative filtering recommendation system with dynamic time decay[J]. The Journal of Supercomputing, 2021, 77(1): 244-262. [CrossRef] [Google Scholar]
Xue F, He X N, Wang X, et al. Deep item-based collaborative filtering for top-N recommendation[J]. ACM Transactions on Information Systems, 2019, 37(3): 1-25. [Google Scholar]
Wu Y T, Zhang X M, Yu H, et al. Collaborative filtering recommendation algorithm based on user fuzzy similarity[J]. Intelligent Data Analysis, 2017, 21(2): 311-327. [CrossRef] [Google Scholar]
George G, Lal A M. Hy-MOM: Hybrid recommender system framework using memory-based and model-based collaborative filtering framework[J]. Cybernetics and Information Technologies, 2022, 22(1): 134-150. [Google Scholar]
Jia R, Li R, Gao M. Study on data sparsity in social network-based recommender system[J]. International Journal of Computational Science and Engineering, 2019, 20(1): 15. [CrossRef] [Google Scholar]
Liu C H, Han C F, Chen T C, et al. Collaborative filtering recommendation algorithm based on penalty factors and time weights[J]. Cyber Security and Data Governance, 2020, 39(5): 17-21(Ch). [Google Scholar]
Tao W C, Dang Y G. Collaborative filtering recommendation algorithm based on grey incidence clustering[J]. Operations Research and Management Science, 2018, 27(1): 84-88 (Ch). [Google Scholar]
Wang Y, Deng J, Gao J, et al. A hybrid user similarity model for collaborative filtering[J]. Information Sciences, 2017, 418: 102-118. [CrossRef] [Google Scholar]
Zhang R L, Zhang R, Wu X N, et al. Collaborative filtering recommendation algorithm based on mixed similarity and differential privacy[J]. Application Research of Computers, 2021, 38(8): 2334-2339(Ch). [Google Scholar]
Chen Y, Liu Z Q. Research on improved recommendation algorithm based on LFM matrix factorization[J]. Computer Engineering and Applications, 2019, 55(2):116-120(Ch). [Google Scholar]
Wang W J, Lu Y M. Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model[J]. IOP Conference Series: Materials Science and Engineering, 2018, 324: 012049. [NASA ADS] [CrossRef] [Google Scholar]
Xiang L. Practical Combat of Recommendation System[M]. Beijing: People's Post and Telecommunications Press, 2012(Ch). [Google Scholar]
Anand R, Beel J. Auto-surprise: An automated recommender-system (AutoRecSys) library with tree of parzens estimator (TPE) optimization[C]//Fourteenth ACM Conference on Recommender Systems. New York: ACM, 2020: 585-587. [Google Scholar]
Spark Apache. Spark mllib programming guide[EB/OL]. [2022-10-23]. https://spark.apache.org/mllib. [Google Scholar]
MovieLens GroupLens. MovieLens data guide[EB/OL]. [2022-11-03]. https://grouplens.org/datasets/movielens. [Google Scholar]

All Figures

	Fig. 1 LFM principle user item rating representation diagram
In the text

	Fig. 2 A hybrid collaborative filtering algorithm model based on ITCSCF and LFM
In the text

	Fig. 3 Spark framework
In the text

	Fig. 4 Distributed computing process of H-CF algorithm on Spark platform
In the text

	Fig. 5 Effect of different data sparsity balance factor on RMSE
In the text

	Fig. 6 Comparison of root mean square error between H-CF algorithm and traditional algorithm
In the text

	Fig. 7 Changes in precision and recall
In the text

	Fig. 8 Comparison of computing time between single machine and Spark cluster under different data scales
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.

[1] Chen J F, Yuan Y, Ruan T, et al. Hyper-parameter-evolutionary latent factor analysis for high-dimensional and sparse data from recommender systems[J]. Neurocomputing, 2021, 421: 316-328. [CrossRef] [Google Scholar]

[2] Yan J, Zeng Q T, Zhang F Q. Summary of recommendation algorithm research[J]. Journal of Physics: Conference Series, 2021, 1754(1): 012224. [NASA ADS] [CrossRef] [Google Scholar]

[3] Chen Y C, Hui L, Thaipisutikul T. A collaborative filtering recommendation system with dynamic time decay[J]. The Journal of Supercomputing, 2021, 77(1): 244-262. [CrossRef] [Google Scholar]

[4] Xue F, He X N, Wang X, et al. Deep item-based collaborative filtering for top-N recommendation[J]. ACM Transactions on Information Systems, 2019, 37(3): 1-25. [Google Scholar]

[5] Wu Y T, Zhang X M, Yu H, et al. Collaborative filtering recommendation algorithm based on user fuzzy similarity[J]. Intelligent Data Analysis, 2017, 21(2): 311-327. [CrossRef] [Google Scholar]

[6] George G, Lal A M. Hy-MOM: Hybrid recommender system framework using memory-based and model-based collaborative filtering framework[J]. Cybernetics and Information Technologies, 2022, 22(1): 134-150. [Google Scholar]

[7] Jia R, Li R, Gao M. Study on data sparsity in social network-based recommender system[J]. International Journal of Computational Science and Engineering, 2019, 20(1): 15. [CrossRef] [Google Scholar]

[8] Liu C H, Han C F, Chen T C, et al. Collaborative filtering recommendation algorithm based on penalty factors and time weights[J]. Cyber Security and Data Governance, 2020, 39(5): 17-21(Ch). [Google Scholar]

[9] Tao W C, Dang Y G. Collaborative filtering recommendation algorithm based on grey incidence clustering[J]. Operations Research and Management Science, 2018, 27(1): 84-88 (Ch). [Google Scholar]

[10] Wang Y, Deng J, Gao J, et al. A hybrid user similarity model for collaborative filtering[J]. Information Sciences, 2017, 418: 102-118. [CrossRef] [Google Scholar]

[11] Zhang R L, Zhang R, Wu X N, et al. Collaborative filtering recommendation algorithm based on mixed similarity and differential privacy[J]. Application Research of Computers, 2021, 38(8): 2334-2339(Ch). [Google Scholar]

[12] Chen Y, Liu Z Q. Research on improved recommendation algorithm based on LFM matrix factorization[J]. Computer Engineering and Applications, 2019, 55(2):116-120(Ch). [Google Scholar]

[13] Wang W J, Lu Y M. Analysis of the mean absolute error (MAE) and the root mean square error (RMSE) in assessing rounding model[J]. IOP Conference Series: Materials Science and Engineering, 2018, 324: 012049. [NASA ADS] [CrossRef] [Google Scholar]

[14] Xiang L. Practical Combat of Recommendation System[M]. Beijing: People's Post and Telecommunications Press, 2012(Ch). [Google Scholar]

[15] Anand R, Beel J. Auto-surprise: An automated recommender-system (AutoRecSys) library with tree of parzens estimator (TPE) optimization[C]//Fourteenth ACM Conference on Recommender Systems. New York: ACM, 2020: 585-587. [Google Scholar]

[16] Spark Apache. Spark mllib programming guide[EB/OL]. [2022-10-23]. https://spark.apache.org/mllib. [Google Scholar]

[17] MovieLens GroupLens. MovieLens data guide[EB/OL]. [2022-11-03]. https://grouplens.org/datasets/movielens. [Google Scholar]