Open Access
Wuhan Univ. J. Nat. Sci.
Volume 28, Number 1, February 2023
Page(s) 29 - 34
Published online 17 March 2023

© Wuhan University 2023

Licence Creative CommonsThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

0 Introduction

Online news websites collect news contents from a variety of sources and provide them to users, attracting a large number of users. However, due to the large amount of news generated every day, it is almost impossible for users to read all the articles. Therefore, it is critical to help users target their reading interests and make personalized recommendations[1-5] .

In order to improve the accuracy of recommendation systems, recent research focuses on learning the representation of news more comprehensively. Deep Knowledge-aware Network (DKN)[4] embeds each news from three perspectives: word, entity and entity context, and then designs a CNN model to aggregate these features together. RippleNet[5] obtains the potential interest of users by automatically and iteratively spreading their preferences in the knowledge graph. However, DKN and RippleNet not only ignore rich semantic topics in the news titles, but also fail to consider the relevance between topics and users' preferences for those topics to learn more precise news representations.

As shown in Fig. 1, news titles may contain not only a variety of entities, such as politicians, celebrities, companies, or institutions, but also multiple topics, such as politics, entertainment, sports, etc., all of which often play important roles in the title. Long- and short-term user representations (LSTUR)[1] uses explicitly given topic information to learn the representation of news titles. Although explicit topic labels can accurately represent the information of the news, when a news title contains two or more different topics, simple topic information may not be detailed enough to give a more comprehensive representation of the news topic. Therefore, we need the latent topic information to model the news titles in more details.

thumbnail Fig. 1 Illustration of news title with a variety of entities and topics

For example, the following news title, "Donald Trump vs. Madonna: Everything We Know", appears as a music topic. However, the content of the news is more relevant to politics. Such misinterpretation in news modeling can lead to serious errors in learning users' topic preferences. Therefore, only considering the explicit topic information and ignoring latent topic information of news will reduce the accuracy of news recommendation systems.

To address the limitations of existing methods and inspired by the wide success of leveraging knowledge graphs, we propose a news recommendation approach based on topic and entity preference in historical behavior. The core of our approach is a news encoder and a user encoder. In the news encoder, we jointly train news title and word vectors to get the topic information of the news and extract entities to construct the knowledge graph. In the user encoder, we use a combination of long short-term memory network and self-attention mechanism to mine users' topic preferences and a graph attention algorithm to mine users' potential preferences for the entities in knowledge graph based on users' historical behavior. Extensive experiments on a real-world dataset prove the validity of our news recommendation method.

1 Our Approach

In this section, we first introduce the overall framework of news recommendation system based on topic embedding and knowledge embedding (NRTK), as illustrated in Fig. 2, then discuss the process of each module with encoders. NRTK contains three parts, news encoder, user encoder and click predictor. For each news, we extract a news representation vector through the news encoder, which uses two modules to extract features of the news, allowing us to obtain embedding vectors set for a user' s clicked news. In the user encoder, we use the long- and short-term memory network (LSTM) combined with self-attention to learn the user's topic preferences, and then use a graph attention algorithm to aggregate the user's entity preferences to obtain the final representation of the user. In the click predictor, we use the scoring function to calculate the probability of a user clicking the candidate news.

thumbnail Fig. 2 The framework of our NRTK approach

1.1 News Encoder

The news encoder module is used to learn news representations from news titles. It contains two modules. The first one is word embedding and knowledge graph embedding. Each news title is composed of a sequence of words, t = . In order to construct the semantic space, we use the word2vec model to pretrain a matrix for word vectors and a matrix for context word vectors. In addition, each word w may be associated with an entity e in the knowledge graph, then we use the TransE[6] to obtain entity embeddings , d is the size of the vectors to be learned for each entity in news title, and take the average value k as the knowledge graph embedding of the title.

The second module is topic-level embedding. We use doc2vec Distributed Bag of Words (DBOW) [7] to learn jointly embedded news title and word vectors. The doc2vec DBOW model consists of a matrix , where is the number of all news titles and m is the size of the vectors to be learned for each news title. For each news title in the corpus, the context vector of each word in the news title is used to predict the news title's vectort' The prediction is .

In the learning process, the news title vectors are required to be close to the word vector of the words in them, and far from the word vector of the words not in them. This results in a semantic space where news titles are closest to the words that best describe them and far from words that are dissimilar to them. In this space, an area where news titles are highly concentrated means that news titles in this area are highly similar. This dense area of news titles indicates that these news titles share one or more common latent topics. We assume that the number of dense areas is equal to the number of topics.

We use the uniform manifold approximation and projection for dimension reduction (UMAP)[8] to reduce the dimension of the news title vector. Then, we can use hierarchical density-based spatial clustering of applications with noise (HDBSCAN) [9,10] to identify the dense clusters of news titles and noise news titles in the UMAP-reduced dimension, and uses a noise label or a label of dense clusters to mark each news title in the semantic embedding space.

The topic vectors can be calculated by assigning labels to each dense news title cluster in the semantic embedding space. Our method is to calculate the centroid, i.e. the arithmetic means of all news title vectors in the same dense cluster.

Finally, we get a matrix where x is the number of topics, m is the dimension of the topic vector. For each news title t, we get its topic embedding as follows:



whereWt is a weight matrix of topics, and is the news title's topic embedding.

The final representation of a news title is the contact of averaged entity embeddings and topic embedding, formulated as:


1.2 User Encoder

The user encoder module is used to learn the representations of users from their browsed news. It contains two modules.

The first one is topic preference learning module. The purpose of this module is to learn long-term and short-term user topic preferences. Since users have different degrees of interest in each historical click news title, and the attention mechanism can capture the topic that the user is interested in, long and short-term memory network combined with the self-attention mechanism can be used to mine users' topic preferences according to the users' historical click behavior.

From the news encoder, we have got news' topic embedding . Given the user's click historical matrix , we can obtain the query Q, key K and value a in the self-attention mechanism by the nonlinear transformation of click historical matrix as follows:



where = are weight matrices of the query and key. Then, the weight matrix P can be obtained as follows:


where P is a similar matrix of click historical matrix . Finally, the output of self-attention can be obtained by multiplying the similarity matrix P and the history matrix Y.


where a is the user preferences. We average the self-attention results to learn a single attention value.


where p is the user topic preference embedding.

The second module is a knowledge graph-level preference propagation module. In the knowledge graph, the head entity is related to many entities through direct or indirect relationships, but the existence of relationships does not mean that users will have the same degree of interest in these entities. This module uses graph attention networks to learn semantic networks.

To describe users' hierarchically extended preferences based on the knowledge graph, we recursively define the set of n-hop relevant entities for user as follows:


represents the entities contained in the news titles that the user has clicked on in the past.

We then define the n-hop triple set of user as follows:


where are triples associated with the entities in .

Given the average value of entity embeddings in user click news titles and the 1-hop triple set of user , we use an attention mechanism to learn the entities the user prefers.


where and are the embeddings of relation and head , respectively. The can be regarded as the weight indicating the user's interest in the entity hi under the relation ri. Users may have different degrees of interest in the same entity with different relations, so taking the relations into account when calculating the weights can better learn the user's interest in entities.

After obtaining the weights, we multiply the tails in with them, and the vectorhop1 can be obtained by linear addition:


where represents the tails in . Through this process, a user's preferences are transferred from his click history to the 1-hop relevant entities along the links in .

By replacing k with hop1 in Eq. (11), the module iterates this procedure over user 's triple set for . Therefore, a user's preference is propagated N times along the triple set from his click history, and N different preference sequences are generated: hop1, hop2, ··· , hopN. To represent the user's final entity preference embeddings, we merge all embeddings.


The embedding f is the output of this module.

The final user representation is the contact of entity preference embedding and topic preference embedding, formulated as:


1.3 Click Predictor

The click predictor is used to predict the probability of a user clicking a candidate news. Denote the representation of a candidate news t as, the click probability score is computed as follows:


where is the sigmoid function.

2 Experiments

2.1 Datasets and Experimental Settings

We use the Bing News server logs from May 16, 2017 to January 11, 2018 as our dataset. Each piece of impression in the dataset contains a timestamp, a news ID, a title, a category label. The basic statistics and distribution of the news dataset are shown in Table 1. In our experiments, we divided the dataset into training set, validation set and test set in a 6:2:2 ratio. The word embeddings are 300-dimensional and initialized by the word2vec model. The entity embeddings are 50-dimensional and initialized by the TransE. And we set the hop number H = 2. These hyperparameters are tuned on validation set. In addition, the experiment was independently repeated for 10 times and the average results in terms of area under curve (AUC) and accuracy (ACC) was taken for performance analysis.

Table 1

Dataset statistics

2.2 Baselines

We use the following models as baselines in our experiments: 1) LSTUR [1], a neural news recommendation method; 2) Factorization Machine Library (LibFM)[11], a feature-based factorization model; 3) Deep Structured Semantic Model (DSSM)[2], a deep structured semantic model; 4) DeepWide[3], a popular neural recommendation method; 5) DeepFM [12], a deep model for recommendation; 6) DKN [4], a deep knowledge-aware network for news recommendation; 7) RippleNet[5], a memory-network-like approach.

2.3 Results

The results of all methods in click-through-rate (CTR) prediction are presented in Table 2. Experimental results show that our recommendation system performs best compared with other recommendation models. Specifically, NRTK outperforms baselines by 1.9% to 8.0% on AUC and 2.1% to 8.3% on ACC, respectively.

We also evaluate the influence of maximal hop number H on NRTK performance. The results are shown in Table 3 which shows that the best performance is achieved when H is 2 or 3. This is because if H is too small, it is difficult to explore the connection and long-distance dependence between entities, while if H is too large, it will bring much more noise than useful signals.

Table 2

The results of and ACC in CTR prediction

Table 3

The results of AUC with respect to different hop numbers

2.4 Ablation Study

To verify the validity of our approach that attention mechanisms can improve recommendation performance, we designed an ablation study to evaluate our model. In this section, instead of using attention mechanisms to capture user preferences for topics and entities, the ablation model simply aggregates them together. The experimental results are shown in Fig. 3. From these results, we find the self-attention and graph attention are very useful. This is because users have different interests on different topics and entities, and capturing users' preferences is important for recommendations.

thumbnail Fig. 3 Effectiveness of different attention networks

2.5 Parameter Sensitivity

In this section, we study the effect of parameters d and training weight of knowledge graph embedding term λ2 on the model performance. We change d from 2 to 128 and λ2from 0 to 1.0, keeping other parameters constant. The results of AUC are shown in Fig. 4. We observe from Fig. 4(a) that the performance of the model improves at the beginning with increasing d, as larger dimensional embeddings can encode more useful information, but degrades after d = 64 due to possible overfitting. From Fig. 4(b), it can be seen that the performance of NRTK reaches the best when λ2 = 0.01.

thumbnail Fig. 4 Parameter sensitivity of NRTK

3 Conclusion

In this paper, we propose NRTK, an end-to-end framework that naturally incorporates the topic model and knowledge graph into recommendation systems. NRTK overcomes the limitations of existing recommendation methods by addressing two major challenges in news recommendation: 1) explicit and latent topic features are extracted from news titles by topic-level embedding, and users' long-term and short-term preferences are mined for them; 2) through knowledge graph-level preference propagation module, it automatically propagates users' potential preferences and explores their hierarchical interests in the knowledge graph. We conduct a lot of experiments in a recommendation scenario. The results show that NRTK has a significant advantage over the strong baseline.

For future work, we plan to improve the efficiency and precision of finding topics and further investigate the methods of characterizing entity-relation interactions.


  1. An M X, Wu F Z, Wu C H, et al. Neural news recommendation with long-and short-term user representations[C]// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2019: 336-345. [Google Scholar]
  2. Huang P S, He X D, Gao J F, et al. Learning deep structured semantic models for web search using clickthrough data[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. New York: ACM, 2013: 2333-2338. [Google Scholar]
  3. Cheng H T, Koc L, Harmsen J, et al. Wide & deep learning for recommender systems[C]// Proceedings of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7-10. [Google Scholar]
  4. Wang H W, Zhang F Z, Xie X, et al. DKN: Deep knowledge-aware network for news recommendation[C]// Proceedings of the 2018 World Wide Web Conference. New York: ACM, 2018: 1835- 1844. [Google Scholar]
  5. Wang H W, Zhang F Z, Wang J L, et al. Ripplenet: Propagating user preferences on the knowledge graph for recommender systems[C]// Proceedings of the 27th ACM International Conference on Information and Knowledge Management. New York: ACM, 2018: 417-426. [Google Scholar]
  6. Bordes A, Usunier N, Garcia-Duran A, et al. Translating embeddings for modeling multi-relational data[C]// Advances in Neural Information Processing Systems. New York: ACM, 2013: 2787-2795. [Google Scholar]
  7. Rehůřek R, Sojka P. Software framework for topic modelling with large Corpora[C]//Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. Stroudsburg: Association for Computational Linguistics, 2010: 45-50. [Google Scholar]
  8. McInnes L, Healy J, Melville J. Umap: Uniform manifold approximation and projection for dimension reduction[EB/OL].[2022-05-18]. http//www.arXiv preprintarXiv:1802.03426 [Google Scholar]
  9. Campello R, Moulavi D, Sander J. Density-based clustering based on hierarchical density estimates[C]// Pacific-Asia Conference on Knowledge Discovery and Data Mining. New York: ACM, 2013:160-172 [Google Scholar]
  10. McInnes L , Healy L. Accelerated hierarchical density based clustering [C]//2017 IEEE International Conference on Data Mining Workshops (ICDMW). Washington D C: IEEE, 2017:33-42. [Google Scholar]
  11. Rendle S. Factorization machines with libfm[C]// ACM Transactions on Intelligent Systems and Technology (TIST). New York: ACM, 2012: 1-22. [Google Scholar]
  12. Guo H F, Tang R M, Ye Y M, et al. DeepFM: A factorization-machine based neural network for CTR prediction[EB/OL].[2022-05-18]. http//www.arXivpreprintarXiv:1703.04247. [Google Scholar]

All Tables

Table 1

Dataset statistics

Table 2

The results of and ACC in CTR prediction

Table 3

The results of AUC with respect to different hop numbers

All Figures

thumbnail Fig. 1 Illustration of news title with a variety of entities and topics
In the text
thumbnail Fig. 2 The framework of our NRTK approach
In the text
thumbnail Fig. 3 Effectiveness of different attention networks
In the text
thumbnail Fig. 4 Parameter sensitivity of NRTK
In the text

Current usage metrics show cumulative count of Article Views (full-text article views including HTML views, PDF and ePub downloads, according to the available data) and Abstracts Views on Vision4Press platform.

Data correspond to usage on the plateform after 2015. The current usage metrics is available 48-96 hours after online publication and is updated daily on week days.

Initial download of the metrics may take a while.