 Research
 Open Access
 Published:
Clickthrough rate prediction model integrating user interest and multihead attention mechanism
Journal of Big Data volume 10, Article number: 11 (2023)
Abstract
The purpose of clickthrough rate (CTR) prediction is to anticipate how likely a person is to click on an advertisement or item. It's required for a lot of internet applications, such online advertising and recommendation systems. The previous clickthrough rate estimation approach suffered from the following two flaws. On the one hand, input characteristics (such as user id, user age, user age, item id, item category) are usually sparse and multidimensional, making them effective. Highlevel combination characteristics are used for prediction. Obtaining it manually by domain experts takes a long time and is difficult to finish; also, customer interests are not all the same. The accuracy of the model findings will significantly increase if this immediately recognized component is incorporated in the prediction model. As a consequence, this study creates an IARM (interactive attention rate estimation model) that incorporates user interest as well as a multihead selfattention mechanism. The deep learning network is used in the model to determine the user's interest expression based on user attributes. The multihead selfattention mechanism with residual network is then employed to get feature interaction, which enhances the degree of effect of significant characteristics on the estimation result as well as its accuracy. The IARM model outperforms other recent prediction models in the assessment metrics AUC and LOSS, and it has superior accuracy, according to the results from the public experimental data set.
Introduction
The recommendation model's main purpose is to automatically suggest possible things to the user based on the user's personal and historical data [1, 2]. For example, in realworld online purchasing, a system with a flawless recommendation model might boost not just user happiness but also sales volume [3, 4]. Each individual produces unique information data as a result of the effect of gender, age, employment, and other aspects, and the recommendation model uses this data to assess the user's probable purchases. The information provided by the user then becomes crucial. The classic recommendation approach is based mostly on the attributes of a user's item rating. The user's preferred category is decided based on the rating, and then a recommendation is issued [5, 6]. The following drawbacks are present: To begin with, many users in real life do not make assessments or intentionally praise or criticize others, instead relying on the rating feature to make suggestions. The model's precision and generalization will suffer. The inability to change is worse [7, 8]; second, every individual in life will have a label that distinguishes them from others, and that label may be discovered [9,10,11]. Third, users and objects have many characteristics, which are not all the same and have diverse impacts on recommendation outcomes. Traditional models pay less attention to crucial types of characteristics, resulting in a waste of valuable data as well as a reduction in model accuracy [12,13,14].
Machine learning and deep learning technologies have become widely employed in the application of recommendation models, thanks to the fast growth of artificial intelligence technology [15]. A number of deep learningbased models have been proposed during the last few years [16, 17]. PNN [18], xdeepFM [19], AFM [20], and a variety of other models are examples. Unlike traditional recommendation models, deep learning recommendation models can automatically capture the complex relationships within the data, as well as nonlinear interaction information between users and items, and obtain more complex and abstract highlevel interactive feature representations [21, 22]. Researchers developed the hypothesis of attention mechanism after being inspired by visual attention. It instructs the neural network to concentrate exclusively on the most significant aspects of the input information, therefore giving them more weight [23, 24]. The model will be able to capture not just the user but also the item in this manner. The crucial combination of features, as well as the weight values of each feature, may be shown, ensuring that the model is easy to understand in the recommendation task [25, 26].
In conclusion, having user interests and automatically producing feature matrices with various weights can increase the accuracy of recommendation model outcomes. As a result, a merger of user interest and multihead attention mechanism is proposed in this research as a clickthrough rate estimate model (IARM). The IARM model takes into consideration not just the impact of user interest on recommendation outcomes, but also feature differences. To create feature matrices with various weights, it employs a multihead selfattention mechanism and a residual network. The model makes the following major contributions:

1.
We proposes a new IARM model. The model uses deep learning and multihead selfattention mechanism technology to automatically obtain various data information, making the data information further utilized.

2.
The IARM model incorporates user interest. Expand the gap between users and improve the accuracy of the recommended results.

3.
The IARM model employs a residual network and a multihead selfattention mechanism to identify crossfeature combinations that are unworthy of weighting, allowing key characteristics to play a larger part in the recommendation process and improving the model's accuracy.

4.
We ran comprehensive tests on a variety of realworld data sets. Our suggested technique not only beats existing stateoftheart approaches for prediction, but also provides strong model explainability, according to experimental findings on the problem of CTR prediction.
The following is a breakdown of how we structure our work: We describe the relevant work in "Related work". Our model's structure was introduced in "IARM model". The experimental findings and extensive analysis are presented in "Experiment". In "Conclusion", we wrap up this study and discuss the next steps.
Related work
User interest
User interest may intuitively represent each user's unique qualities, hence it plays a critical role in the recommendation model [27,28,29]. For example, Google's Wide & Deep model, which combines the benefits of a linear shallow model with a deep model, employs the shallow model's memory properties to capture each user's interest. The Alibaba Company's suggested DIN [30] model combines the user's previous behavior sequence and attention mechanism to dynamically compute the user's interest changes, which increases the accuracy of the recommendation results to a degree.
Learning feature interactions
Learning feature interactions is a fundamental subject that has received a lot of research attention. Factorization Machines (FM), which were designed to primarily capture first and secondorder feature interactions and have been shown to be useful for a variety of tasks in recommender systems [31], are a wellknown example. Following that, many factorization machine variations were suggested. Fieldaware Factorization Machines (FFM), for instance, modeled finegrained relationships between features from many fields. The relevance of various secondorder feature interactions was studied in GBFM and AFM. All of these methods, on the other hand, are geared at simulating loworder feature interactions.
Recent research has attempted to predict highorder feature interactions. To simulate higherorder features, NFM built deep neural networks on top of the output of secondorder feature interactions. Similarly, feedforward neural networks were used to describe highorder feature interactions in PNN [32], FFM, DeepCrossing, Wide & Deep, and DeepFM. All of these methods, however, learn highorder feature interactions in an implicit manner, resulting in poor model explainability. On the other hand, there are three lines of study that explicitly learn feature interactions. First, Deep & Cross and xDeepFM took the bitwise and vectorwise outer product of features, respectively. Although they execute explicit feature interactions, determining which combinations are advantageous is not straightforward. Second, certain treebased techniques [33] integrated the strength of embeddingbased and treebased models, but the training procedure had to be broken down into many steps.
Selfattention and residual networks
Attention and residual networks are two of the most recent deep learning approaches used in our proposed model. Attention was initially suggested in the context of neural machine translation, and it has since been demonstrated to be useful in a range of tasks, including question answering [34], text summarization and recommender systems. Vaswani et al. went on to suggest multihead selfattention as a way to simulate complex word relationships in machine translation [35]. In the ImageNet competition, residual networks earned stateoftheart results. The residual connection, which can be written as y = F (x) + x, promotes gradient flow over interval layers, making it a common network topology for training very deep neural networks [36, 37].
In summary, this research provides a model for estimating clickthrough rates that combines user interest with a multihead attention mechanism. The model first employs deep learning technology to automatically collect each user's unique interest expression in order to build a distinction between users; next, using the multihead attention mechanism and residual network, it obtains feature combinations with various weights. The output layer then outputs the forecast result.
IARM model
The suggested IARM approach, which can automatically learn the feature interaction for CTR prediction, is initially described in this section. Following that, this article will show how to employ the multihead attention mechanism to learn user interest representation and model highorder combination characteristics. The model's structure is depicted in Fig. 1.
Overview
The IARM model's purpose is to transfer the user's longterm interest matrix, as well as highorder interaction characteristics and matrices with varying weight values, into a lowdimensional space. The approach suggested in this research takes the feature vector x as an input and projects all of the features into the same latitude space using an embedding layer. The interest layer then processes the user information to produce the user interest expression. To obtain a highorder cross feature matrix and features with varying weight information, input extensive field information into the interactive layer. Finally, the three feature matrices are merged to produce the final feature matrix, which is sent via the output layer.
Input layer
We start with a sparse vector, which is the concatenation of all fields, to represent user profiles and item attributes. Specifically,
where M is the total number of feature fields and xi is the ith field's feature representation. If the ith field is categorical, xi is a onehot vector (e.g., × 1 in Fig. 2). If the ith field is numerical, xi is a scalar value (e.g., xM in Fig. 2).
Embedding layer
Because categorical feature representations are sparse and highdimensional, converting them to lowdimensional spaces is a typical practice (e.g., word embeddings). In particular, we use a lowdimensional vector to represent each categorical feature, i.e.
where Vi is a field I embedding matrix and xi is a onehot vector. Categorical features are frequently multivalued, i.e., xi is a multihot vector. Take, for example, movie watching prediction; there might be a feature field Genre that identifies the genres of movies and can be multivalued (e.g., Drama and Romance for the movie "Titanic"). To make Eq. (2) compatible with multivalued inputs, we extend it and express the multivalued feature field as the average of related feature embedding vectors:
where q is the number of values a sample has for the ith field and xi denotes the multihot vector representation of this field.
We also encode numerical characteristics in the same lowdimensional feature space as category features to facilitate interaction between them. We represent the numerical characteristic as follows:
where vm is an embedding vector for field m, and xm is a scalar value.
The embedding layer's output would thus be a concatenation of numerous embedding vectors, as seen in Fig. 2.
Interest acquisition layer
To begin, get the user information feature matrix, which is stated as follows:
Here this article uses a multilayer perceptron method to obtain the user's interest expression, the specific function is as follows:
Among them, Zi represents the output result of each layer of the network, Wi represents the training matrix of each layer of the network, bi represents the paranoia item of each layer, f represents the relu activation function, and u represents the user. The information feature matrix of U, U reflects the user's interest feature matrix.
Interaction layer
Once the numerical and category characteristics are in the same lowdimensional space, we may move on to modeling highorder combinatorial features. The main issue is determining which characteristics should be merged to generate relevant highorder features. Traditionally, domain experts achieve this by creating meaningful combinations based on their knowledge. In this study, we address this issue using a unique approach called the multihead selfattention mechanism.
Recently, a multihead selfattentive network shown amazing effectiveness in modeling intricate relationships. For example, it outperforms arbitrary word dependency modeling in machine translation and sentence embedding, and has been effectively extended to capture node similarities in graph embedding. In this paper, we expand the newest approach to describe the relationships between distinct feature fields. The structure of the interaction layer is shown in Fig. 3.
To be more specific, we use the keyvalue attention mechanism to decide which feature combinations are relevant. Using feature m as an example, we will show how to find many significant highorder features using feature m. We begin by defining the association between feature m and feature k under a certain attention head h as follows:
where (h)(,)is an attention function that determines the similarity between the features m and k. It can be defined as a neural network or as a simple inner product, i.e.,. We used inner product in this task since it is simple and effective. W(h)Query,W(h)KeyRd'd in Eq. (5) are transformation matrices that convert the original embedding space Rd into a new space Rd′. Following that, we update the representation of feature m in subspace h by merging all relevant features led by coefficients(h)mk:
where \({{\varvec{W}}}_{{\varvec{V}}{\varvec{a}}{\varvec{l}}{\varvec{u}}{\varvec{e}}}^{({\varvec{h}})}\) ∈ Rd′ × d, e^{(h)}_{m} ∈ Rd′ denotes a new combinatorial feature acquired by our technique since it is a combination of feature m and its relevant features (under head h). In addition, a feature is likely to be implicated in many combinatorial features, which we do by employing multiple heads that form various subspaces and learn diverse feature interactions individually. In all subspaces, we gather the following combinatorial features:
where H is the number of total heads and is the concatenation operator. We use typical residual connections in our network to maintain previously learnt combinatorial characteristics, such as raw individual (i.e., firstorder) features. Formally
where ReLU(z) = max(0, z) is a nonlinear activation function, and WResRd'Hd is the projection matrix in case of dimension mismatching [38]. The representation of each feature em will be modified into a new feature representation eResm, which is a representation of highorder features, as a result of such an interaction layer. Multiple similar layers can be stacked, with the output of the previous interacting layer feeding into the next interacting layer. We can represent arbitraryorder combinatorial features as a result of this.
Output layer
The interaction layer produces a collection of feature vectors eConmMm = 1, which contain raw individual features reserved by the residual block as well as combinatorial features gained by the multihead selfattention process. We just concatenate them all and then use a nonlinear projection as follows for the final CTR prediction:
where wRd'HM is a column projection vector that linearly combines concatenated features, b is the bias, and (x) = 1/(1 + ex) converts the values to users clicking probabilities.
Training
Our loss function is Log loss, which is defined as follows:
where yj and \(\widehat{y}\)j are ground truth of user clicks and estimated CTR respectively, j indexes the training samples, and N is the total number of training samples. The parameters to learn in our model are {V_{i}, Vm, \({\mathrm{W}}_{\mathrm{Query}}^{(\mathrm{h})}\), \({\mathrm{W}}_{\mathrm{Key}}^{(\mathrm{h})}\), \({\mathrm{W}}_{\mathrm{Value}}^{(\mathrm{h})}\), W^{Res}, w, b}, which are updated via minimizing the total Logloss using gradient descent.
Experiment
Experimental setup
Experimental data set
Data Sets. Four publicly available realworld data sets are used in this study. Table 1 summarizes the statistics for the data sets. Criteo3 This is a CTR prediction benchmark dataset with 45 million click records on shown adverts. It has 26 numerical and 13 category feature fields. Avazu4 This dataset provides information on users' mobile activities, such as whether or not they click on a presented mobile ad. It comprises 23 feature fields that range from user/device characteristics to ad properties. MovieLens1M6 Users' movie ratings are collected in this collection. We consider samples with a rating of less than 3 to be negative samples during binarization since a low score suggests that the user dislikes the film. Positive samples (those with a rating more than 3) are kept, whereas neutral samples (those with a rating of 3 or less) are discarded.
Evaluation metrics
To assess the effectiveness of all strategies, we employ two widely used criteria.
Area of the University of Chicago The likelihood that a CTR predictor would award a higher score to a randomly chosen positive item than a randomly chosen negative item is measured by the area under the ROC Curve (AUC). AUC is a measure of how well something works. The greater the AUC, the better.
We adopt Logloss as a clear measure since all models try to reduce the Logloss described by Eq. (10).
It's worth noting that for the CTR prediction job, a slightly higher AUC or lower Logloss at the 0.001level is considered significant, as has been previously mentioned.
Comparison model
FM models secondorder feature interactions using factorization techniques.
AFM. AFM is one of the most advanced models for capturing the interplay of secondorder features. It extends FM by using the attention mechanism to discern between the relative relevance of secondorder combination characteristics.
NFM. On the secondorder feature interaction layer, NFM superimposes a deep neural network. The interplay of highorder features is implicitly captured by the nonlinearity of neural networks.
deepFM.deepFM utilizes the deep layer's deep learning to gain highlevel crossover features, FM collects lowlevel crossovers, and both high and lowlevel crossover features are acquired at the same time.
Widedeep. The memory features of the broad layer learning model are used in the Widedeep model, while the deep layer learns the model's generalization characteristics.
Deepcrossing. The Deepcrossing model incorporates a residual network based on deepfm, which enhances the model's interpretability.
DCN. DCN can successfully capture a narrow range of effective feature interactions, learn highly nonlinear effects, and has a cheap computing cost. It does not involve human feature engineering traversal or search.
PNN. The PNN model obtains highlevel and lowlevel cross features using the inner product and outer product to arrive at the final recommendation result.
Autoint. To produce weighted cross features, Autoint employs a multihead selfattention technique.
Comparative experiment
In accordance with the Table 2 experimental findings. The following conclusions may be derived from the results of the experiment: (1) Attention mechanisms are investigated using FM and AFM models. The AFM model has a greater experimental impact than the FM model on all data sets, indicating that the attention mechanism is involved in the recommendation model. (2) As shown in the table above, several models that capture highlevel crossfeature interactions have advantages and disadvantages. When the deepfm model with highorder cross features is compared to the fm model without highorder cross features, the suggested model's accuracy improves. (3) On three separate data sets, the suggested IARM model has the greatest AUC and the lowest LOSS when compared to other models. It demonstrates that the IARM model provides more accurate and effective recommendations.
Ablation experiment of the model
This paper conducts an ablation investigation and compares multiple IARM variations in order to further validate and comprehend the paradigm described in this article.
The influence of personal interest on the model
The user interest module is integrated into the basic IARM paradigm, allowing it to learn about each user's individual interests. This study isolates the interest module from the IARM model and keeps the status quo of other structures to establish an IARM* model in order to test the interest module's efficacy. The performance of all data sets will suffer if the interest module is removed, as demonstrated in the Table 3. In particular, on the criteo, avazu, and movielens data sets, the IARM model outperforms the change model IARM*. This demonstrates that the interest module of the IARM model developed in this research contributes significantly to the accuracy of the recommendation outcomes.
The influence of network layer parameters on the model
By superimposing numerous interacting layers on top of each other, the IARM model suggested in this study learns highorder feature combinations. As a result, the focus of this research is on how the model's performance varies with the number of interaction layers, specifically if the model's number of interaction layers influences the combination characteristics. It refers to the acquisition of highlevel characteristics of nonprogressive input from raw data if there is no interaction layer mentioned in this article. As illustrated in the diagram above, the findings are summarized. The performance of the movielens data set is greatly enhanced when an interaction layer is utilized, i.e., feature interaction is taken into account, demonstrating that the combined features give extremely relevant information for prediction. The model's performance improves further if the number of interaction layers is increased, taking into account highorder combination characteristics. When the number of layers approaches three, performance stabilizes, demonstrating that adding extremely highorder features does not give predictive information (Fig. 4).
The influence of selfattention mechanism
The IARM model has a multihead selfattention mechanism that allows it to assign various weights to different variables, improve the influence of relevant features on recommendation outcomes, and improve recommendation accuracy. This research provides an IRM model that eliminates the multihead selfattention mechanism while leaving other structures unaltered in order to test the usefulness of the module. The overall influence of the model on the criteo, avazu, and movielens data sets has diminished when the multihead attention module is eliminated, as can be seen in the Table 4. This demonstrates that the IARM model's multihead selfattention mechanism had a role.
Influence of residual network
The residual network, which can learn all the combined features, is used in the typical IARM model in this article, enabling for the modeling of extremely highorder combinations. This research removes the residual network from the standard model IARM in order to demonstrate its contribution to the model, while maintaining the status quo of other structures. The performance of all data sets will suffer if the residual network is removed, as seen in the Table 5. On the criteo, avazu, and MovieLens datasets, the entire model IARM performs much better than the version IARM, demonstrating that residual connection is required for modeling highorder feature interactions in our proposed technique.
Visual explanation
A good recommendation model can help improve not just the quality of recommendations, but also their interpretability. We'll use movielens as an example in this section to explain how the IARM model provides a good crossfeature combination.
The association between several feature fields in the data is also examined in this article. Based on the average attention scores of all characteristics in the data, this study calculates the correlation between them. The graph above summarizes the relationship between the various characteristics. It can be observed that the characteristics sex, age, and sex, age (that is, lightcolored patches) have a high link, and this combination of features will play a significant role in the recommendation outcomes (Fig. 5).
Model generalization
The term "model generalization" describes whether or not a model is similarly accurate when applied to fresh data. The criteo data set is used in this section as an example of how to partition a data set into a training set and a test set with a ratio of 0.2 and 0.3, respectively. The data set divided by the 0.3 column may be divided into multiple test sets, yielding more data for the model. As seen in the Table 6, AUC and LOSS alter according on the model's division ratios. Overall, the IARM model has the highest AUC value while simultaneously having the lowest LOSS value. Furthermore, the model's AUC varies relatively little as the test set grows, having the maximum AUC value at the end. This demonstrates that the IARM model still outperforms other models on the new data set and has a significant generalization ability.
Conclusion
This research provides a recommendation model that incorporates both user interest and a multihead attention mechanism. This model can learn the user's preferences and how highlevel features interact automatically. The multihead selfattention mechanism's newly added user interest layer and interaction layer, which allows each feature to interact with other features and assess feature significance through learning, are the key to the technique in this study. The model's structure, interpretability, and generalization are all discussed and analyzed in this article. The results of the experiments on three real data sets show that the model described in this research is more effective and accurate in its recommendations. In order to increase the recommendation model's accuracy, we'd like to incorporate contextual information into our process in the future.
Availability of data and materials
Not applicable.
References
Deep Session Interest Network for ClickThrough Rate Prediction. arXiv:1905.06482v1 [cs.IR] 16 May 2019.
AutoInt: Automatic Feature Interaction Learning via SelfAttentive Neural Networks. arXiv:1810.11921v2 [cs.IR] 23 Aug 2019.
Wang R, Shivanna R, Cheng D Z, et al. DCN V2: Improved Deep & Cross Network and Practical Lessons for Webscale Learning to Rank Systems. 2020.
Yi T, Dehghani M, Bahri D, et al. Efficient Transformers: A Survey. 2020.
Shen X, Yi B, Liu H, et al. Deep variational matrix factorization with knowledge embedding for recommendation system. IEEE Trans Knowl Data Eng. 2019;99:1–1.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. Attention is all you need. 2017.
Feature Generation by Convolutional Neural Network for ClickThrough Rate Prediction. arXiv:1904.04447v1 [cs.IR] 9 Apr 2019.
Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167; 2015.
Deep & Cross Network for Ad Click Predictions. arXiv:1708.05123 [cs.LG] 17 Aug 2017.
Operationaware Neural Networks for User Response Prediction. arXiv:1904.12579v1 [cs.IR] 2 Apr 2019.
Jianfang W, Xilin W, Xu Y, Qiuling Z. Deviationbased graph attention neural network recommendation algorithm. Control Decis 2021; 1–9.
Wide & Deep Learning for Recommender Systems. arXiv:1606.07792v1 [cs.LG] 24 Jun 2016.
LLC, Lee DL, Liu Z, et al. Multiinterest network with dynamic routing for recommendation at Tmall. ACM. 2019.
Jianing Z, Jingsheng L, Xuexue Z. Graph network social recommendation algorithm based on agrugnn. Comput Syst Appl. 2021;30(05):219–27.
Zou H, Zheng M. Random node recommend algorithm for influence maximization in social network[C]//2018 9th International Conference on Information Technology in Medicine and Education (ITME). IEEE Computer Society, 2018.
Shan Y, Ryan Hoens T, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. Deep crossing: Webscale modeling without manually crafted combinatorial features. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 255–262.
Pande H. Fieldembedded factorization machines for clickthrough rate prediction. 2020.
Qu Y, Han C, Kan R, et al. Productbased neural networks for user response prediction[C]//2016 IEEE 16th International Conference on Data Mining (ICDM). IEEE, 2016.
xDeepFM: Combining explicit and implicit feature interactions for recommender systems. arXiv:1803.05170v3 [cs.LG] 30 May 2018.
Xiao J, Ye H, He X, et al. Attentional factorization machines: learning the weight of feature interactions via attention networks. 2017.
Kuifeng Yu, Guihua D, Xiang S. College entrance examination volunteer recommendation algorithm based on multifeature weight fuzzy clustering. J Central South Univ (Nat Sci Edn). 2020;51(12):3418–29.
Rui WH, Sui L, Jian HL. News recommendation algorithm based on the combination of content recommendation and time function. Comput Digital Eng. 2020;48(12):2973–7.
Zhang W, Du T, Wang J. Deep learning over multifield categorical data: a case study on user response prediction. 2016.
Pan J, Xu J, Ruiz A L, et al. Fieldweighted factorization machines for clickthrough rate prediction in display advertising. 2018.
Ye Q, Xiongkai S, Rong G, Chunzhi W, Jing L. Social recommendation algorithm based on attention gated neural network. Comput Eng Appl. 2021; 1–9.
Yan G. Research on recommendation algorithm based on the convolutional neural network. Nanjing University of Posts and telecommunications, 2020.
Gai K, Zhu X, Li H, et al. Learning piecewise linear models from large scale data for ad click prediction. 2017.
Shengyun Z, Hengzhong J. Factorization machine. J Digital Content Soc Korea. 2017; 18.
Guo H, Tang R, Ye Y, Li X, He X. Deepfm: a factorizationmachinebased neural network for ctr prediction. arXiv preprint arXiv:1703.04247; 2017.
Zhou G, Song C, Zhu X, et al. Deep interest network for clickthrough rate prediction. 2017.
Slim A, Hush D, Ojha T, et al. An automated framework to recommend a suitable academic program, course and instructor[C]//2019 IEEE Fifth International Conference on Big Data Computing Service and Applications (BigDataService). IEEE, 2019.
Covington P, Adams J, Sargin E. Deep neural networks for youtube recommendations//Acm Conference on Recommender Systems. ACM, 2016;191–198.
Zhang W, Du T, Wang J. Deep learning over multifield categorical data. In: European conference on information retrieval. Springer, 2016; p. 45–57.
Zhu J, Shan Y, Mao JC, Yu D, Rahmanian H, Zhang Y. Deep embedding forest: Forestbased serving with deep embedding features. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017; p. 1703–1711.
WangCheng K, McAuley J. Selfattentive sequential recommendation. In 2018 IEEE International Conference on Data Mining (ICDM), p. 197–206. IEEE, 2018.
Zhou G, Zhu X, Song C, Fan Y, Zhu H, Ma X, Yan Y, Jin J, Li H, Gai K. deep interest network for clickthrough rate prediction. In: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2018; p. 1059–1068.
Xiao J, Ye H, He X, Zhang H, Wu F, Chua TS. Attentional factorization machines: learning the weight of feature interactions via attention networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence. AAAI Press, 2017; p. 3119–3125.
Zhang QL, Rao L, Yang Y. DGFFM: generalized fieldaware factorization machine based on DenseNet[C]//2019 International Joint Conference on Neural Networks (IJCNN). 2019.
Acknowledgements
The authors would like to thank all the anonymous reviewers for their insightful comments. This work was financially supported by the Bingtuan Science and Technology Public Relations Project, a datadriven regional smart education service key technology research and application demonstration (2021AB023). The authors would like to thank colleagues and the anonymous reviewers who have provided valuable feedback to help improve the paper.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
WZ, YH wrote the main manuscript text and BY, ZZ prepared Tables 1, 2 and 3. All authors reviewed the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhang, W., Han, Y., Yi, B. et al. Clickthrough rate prediction model integrating user interest and multihead attention mechanism. J Big Data 10, 11 (2023). https://doi.org/10.1186/s40537023006886
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023006886
Keywords
 User interest
 Multihead selfattention mechanism
 Residual network
 Clickthrough rate prediction model