Skip to main content

A hybrid recommender system based-on link prediction for movie baskets analysis

Abstract

Over the past decade, recommendation systems have been one of the most sought after by various researchers. Basket analysis of online systems’ customers and recommending attractive products (movies) to them is very important. Providing an attractive and favorite movie to the customer will increase the sales rate and ultimately improve the system. Various methods have been proposed so far to analyze customer baskets and offer entertaining movies but each of the proposed methods has challenges, such as lack of accuracy and high error of recommendations. In this paper, a link prediction-based method is used to meet the challenges of other methods. The proposed method in this paper consists of four phases: (1) Running the CBRS that in this phase, all users are clustered using Density-based spatial clustering of applications with noise algorithm (DBScan), and classification of new users using Deep Neural Network (DNN) algorithm. (2) Collaborative Recommender System (CRS) Based on Hybrid Similarity Criterion through which similarities are calculated based on a threshold (lambda) between the new user and the users in the selected category. Similarity criteria are determined based on age, gender, and occupation. The collaborative recommender system extracts users who are the most similar to the new user. Then, the higher-rated movie services are suggested to the new user based on the adjacency matrix. (3) Running improved Friendlink algorithm on the dataset to calculate the similarity between users who are connected through the link. (4) This phase is related to the combination of collaborative recommender system’s output and improved Friendlink algorithm. The results show that the Mean Squared Error (MSE) of the proposed model has decreased respectively 8.59%, 8.67%, 8.45% and 8.15% compared to the basic models such as Naive Bayes, multi-attribute decision tree and randomized algorithm. In addition, Mean Absolute Error (MAE) of the proposed method decreased by 4.5% compared to SVD and approximately 4.4% compared to ApproSVD and Root Mean Squared Error (RMSE) of the proposed method decreased by 6.05 % compared to SVD and approximately 6.02 % compared to ApproSVD.

Introduction

Due to the high importance of recommender systems in social networks, real life, e-commerce, shopping cart analysis, etc., a lot of research has been done in recent years [1,2,3]. Recommender systems are one of the most popular systems that have attracted the attention of various researchers during the past decade. Recommender systems are used to filter huge amount information, such as users’ cart [4]. Recommender systems are used in a variety of fields such as shops, libraries, restaurants, tourism systems, shopping carts and other environments to provide attractive items such as movie services [5]. These systems play an important role in e-commerce [6]. Due to the huge amount of information that exists, providing the most appealing services with high accuracy and appropriate time is one of the important issues. The service recommender system enables users to review products having features such as product’s name, manufacturer, production date, brand type, and so on. For users who are new and there is not enough information about them in the system (they have cold start problem), the recommender system offers a list of products which are rated by other users [7]. One of the most important challenges of recommender systems is the challenge of user’s cold start [8]. The problem of cold start occurs when the user has no activity or transaction in the system. Due to the cold start problem of users, a variety of recommender systems have been proposed. In general, recommender systems are divided into two categories:

  • Traditional recommender systems

    1. A.

      Content-base filtering [9].

    2. B.

      Collaborative filtering [10].

    3. C.

      Hybrid recommender systems [11].

  • Modern recommender systems

    1. A.

      Demographic-based approach [12].

    2. B.

      Knowledge-based approach [13].

The methods that have been studied by various researchers are collaborative and content-based filtering systems. Content-based systems classify users based on their demographic information. Collaborative filtering systems are one of the most widely used recommendation techniques that offer users the items that have been rated or selected by other similar users [14]. For example, if two users have similar interests and behaviors, they recommend the purchased service system (film) to each other [15]. In this system, unlike content-based systems, similar users are identified and items which are highly rated are offered to them. This method is used to present a list of products to a group of users using data mining (clustering) techniques [16]. Using similarity criteria in collaborative systems to find adjacent users or similar activities is one of the main requirements of making recommendations. Similarity criteria in recommender systems make it possible to identify similar users or services based on their demographic activity, category and information. In this study, similarity criteria were used in collaborative recommender system to offer the similarity level of items that are rated by other users to the new user in different steps.

The most important challenges and problems of online systems are the loss of customers and the lack of attractive products for them. Various methods have been proposed so far to address these challenges, each of which has its drawbacks. Therefore, in this paper, we will present a hybrid method that improves the challenges of other methods. The proposed method is a combination of DNN and DBSCAN clustering algorithm in the CBR core and a combination of hybrid similarity criteria and the new Pro-FriendLink algorithm in CRS.

In general, this study presents a hybrid system based on CBRS and CRS for analyzing user’s cart in an online movie system. In the CBRS, DBScan clustering algorithm and DNN algorithm are used to determine basic categories for users based on demographic information and also to classify new users. One of the most important reasons for using DBScan algorithm for the initial clustering of users based on demographic information is its speed and the ability to support large amounts of information compared to other clustering algorithms. Also, the most important reason for using DNN algorithm to classify new users is its ability to support huge amount of information and hidden layers compared to other methods is classification. The DNN enables new users to be transferred to the target group with high accuracy. The CRS uses a combination of similarity criteria and the improved FriendLink link algorithm to determine the similarity between new users and other users. With the hybrid similarity criteria, the similarity level of users and the new user is calculated in terms of a threshold. The improved FriendLink link algorithm is used to provide friend recommendations based on user communication in online movie system. Therefore, in this paper, a combination of 4 phases is used to analyze the customer baskets. Customer basket analysis is a combination of DNN algorithms and DBScan clustering, which is an innovation in itself. Also, a hybrid similarity criterion and a new improved link prediction algorithm called Pro-FriendLink have been used in the core of the CRS, which have not been used in any paper so far. Therefore, it is one of the most important innovations of the proposed design.

So, the main contribution of this paper is:

  • The combination of DNN and DBSCAN clustering algorithm in the CBR core.

  • The combination of hybrid similarity criteria and the new Pro-FriendLink algorithm in CRS.

  • The proposed Pro-FriendLink algorithm for a new method in RSs.

The remainder of this paper will be presented as follows: “Related works” section reviews the literature, “The proposed method” section describes the proposed approach and architecture, and in “Results” and “Discussion” sections, the results are presented and the conclusions are discussed.

Related works

In this study, for making recommendations in movie systems, several researchers tried to solve the problem of cold start. Kim et al. [17] mentioned the cold start problem concerning movies and users. They introduced an important traditional system of collaborative filtering. In this model, two matrixes of similarity were used, one of which showed the similarity between users and movies and the other one showed the similarity between users themselves. Then, concerning the mechanism of the discussed forecast, they made some recommendations to the users. One of the weaknesses of this study was the high memory usage concerning members (users) and movies which was due to the construction of several similarity matrixes [17]. Bobadilla et al. [18] used the neural network as an RS of the collaborative filtering to reduce cold start issues for new users. They assessed the Movielens dataset and Netflix and due to the usage of non-numeric data, they used Jaccard Similarity Index [18]. Byström [19] recommended movies to users by clustering movies and using k-means algorithm. He carried out it based on users’ comments about movies. Byström studied famous Movielens dataset and implemented the presentation for data collection with 10,109 movies that were assessed by 2113 users [19]. Lika et al. [20] introduced a model in which classification algorithms such as Naïvebays, decision tree, and random classification algorithm were used as similarity metrics in order to recommend movies to users. Also, they evaluated Movielens dataset [20]. In order to enhance the performance of the system and to solve cold start problem, Pereira et al. [21] posed the hybrid method including both collaborative filtering and demographic information. In this study, they used the hybrid co-clustering algorithm and knowing the machine for solving the cold start problem and evaluated Movielens, Jester, and Netflix dataset [21], Sperlì et al. in [22], provided a recommendation system to improve social networking approach. In this paper, an RS which is designed for big data applications is used to provide useful recommendations on online social networks. The proposed technique is a collaborative and user-centric approach that exploits the interactions between users and creates multimedia content on one or more social networks in a new and effective way. Experiments on the data collected from several online social networks revealed the feasibility of the approach regarding the problem of social media proposition. Kutty et al. in [23], presented recommender systems for large social networks: reviewing challenges and solutions. This paper states that social networks are crucial for networking, communication, and content sharing. Social networking applications generate a great deal of information on a daily basis, and social networks are subject to extensive research due to the heterogeneity of data and the structures within them, their size and dynamics. When such a large amount of data is used by recommender systems, the connection result can help to solve social business issues and to improve friends’ recommendations. This paper is a review paper that has compared some trends with each other. Lin et al. in [24], developed a recommendation system based on neural network for recommending movies to users. Due to unimportant challenges like scalability, dispersion and user’s confidence compared with cold start and movies which have been researched till now, the challenges have also been resolved with preprocessing, clustering and classification. Walek et al. in [25] the main objective of this paper to propose a hybrid recommender system predictor for recommending suitable movies. This system contains a recommender module combining a collaborative filtering system, a content-based system, and a fuzzy expert system.

Table 1 summarizes previous approaches to movie recommendation and the cold start challenge. This table outlines the advantages and disadvantages of each method.

Table 1 Summarizes previous approaches to movie recommendation and the challenge of cold start users

Figure 1 show the results of researches conducted from 2000 to 2019 in link prediction and recommender system.

Fig. 1
figure1

Research on link prediction from 2000 to 2019

As can be seen, a large number of research on link prediction were done in 2013 and 2014. The results of another study concerning the research which were conducted from 2000 to 2019 on recommender systems in social networks are shown in Fig. 2.

Fig. 2
figure2

Research from 2000 to 2019 on recommender systems in social networks

As can be seen in the figure above, a large number of research on recommender systems were carried out in 2013 and 2014. Also, in 2015 and 2016 a great number of studies were conducted. Figure 3 shows a comparison of the conducted researches from 2000 to 2019 in the field of link prediction and recommender systems.

Fig. 3
figure3

Comparison of researches conducted from 2000 to 2019 in link prediction and recommender systems

The number of link prediction studies is about 0.5 times higher than that of recommender systems. Therefore, through reviewing the literature it has been observed that various methods have been proposed to provide users with recommendations in movie services. Each of these methods has its own challenges, such as inadequate accuracy, high error rate and lack of appealing services. This paper presents a recommender system based on a combination of content-based and CRSs which solves both the problem of cold start and addresses the challenge of users’ trust.

According to the review of previous records and research in the field of customer basket analysis in online movie systems, it was observed that each of the proposed methods has challenges and problems such as high error and lack of accuracy. Therefore, in this paper, we will present a hybrid method that improves the challenges of other methods. The proposed method is a combination of DNN and DBSCAN clustering algorithm in the CBR core and a combination of hybrid similarity criteria and the new Pro-FriendLink algorithm in CRS.

The proposed method

In this section, we will describe and present the proposed method with regard to the flowchart that is being presented and in the following sections, the items specified in the flowchart will be described in full detail.

Fig. 4
figure4

Block diagram of the proposed method

As can be seen in Fig. 4, the Movielens dataset is first introduced into the hybrid recommender system. This database has three sections: datasets of user communications, demographic information and rated movies. The datasets of demographic information and rated movies are used in CBRSs and the user communication dataset is used in collaborative systems. After entering the dataset into the proposed system, the first phase of the proposed method is executed. The first phase involves content-based recommendation system. In CBRS, DBScan clustering algorithm is used for the initial clustering of users’ dataset based on demographic information and DNN algorithm to classify new users. In this phase, all users are first clustered based on demographic information such as age, occupation and gender using DBScan algorithm.

For all users in the system, a label is specified as a cluster. Then the categories specified for each user are determined as label attributes. In phase 1, the CBRS initializes the clustering and assigning new users to categories. After determining the categories, phase 2 of the proposed method begins. The phase includes a CRS based on new similarity criteria. In this phase, similarities are calculated based on a threshold (lambda) between the new user and the users in the selected category. Similarity criteria are computed based on age, gender, and occupation. The CRS extracts users who are the most similar to the new user. Then higher rated movie services are suggested to the new user based on the proximity matrix. After determining the movie services for the new user, phase 3 which is the Link Prediction Algorithm begins. At this stage, movies are transmitted through user communication dataset. In this phase, movies that are similar based on the communication between associated users and the new user are selected. Phase 4 of the proposed method is related to the output of CRS and the improved FriendLink algorithm. Based on the block diagram above, movie services are offered to new users. Finally, the results are evaluated. Based on the phases described in the proposed method, the steps of the proposed method are as follows:

Phase 1: Content-based recommender system (CBRS)

After loading the dataset into the proposed system, the first phase of the proposed method is executed. The first phase involves a CBRS. In the CBRS, DBScan is used for the initial clustering of users’ dataset according to demographic information and DNN algorithm in order to classify new users. In this phase, all users are first clustered based on demographic information such as age, occupation and gender using the DBScan algorithm. For all users in the system, a label is specified as a cluster. In phase 1, the CBRS clusters users and assigns them to categories.

Clustering all users with DBScan algorithm

At this point, the DBScan clustering algorithm separates users and determines the clusters based on demographic information of users. One of the most important features of the DBScan clustering algorithm compared to other clustering algorithms such as K-Means and X-Means, etc., is that the algorithm identifies and separates heterogeneous data. This reduces the complication of the model and in addition to improving the processing speed, improves the initial clustering process of users in the online movie system. DBScan clustering algorithm is a density-based spatial clustering algorithm that can also define anomalies in the dataset. DBScan clustering algorithm requires two user-defined parameters: Epsilon Proximity Distance (EPS) and the minimum number of minpts. For a given point, points in the eps distance are called adjacent points of that point. If the number of adjacent points is greater than minpts, this group of points is called cluster.

DBScan clustering algorithm labels data points as prime points, boundary points and remote points. eps have the lowest rating. The pseudo-code of DBScan clustering algorithm is given in the following algorithm. The inputs of this algorithm are the user-defined datasets and parameter values of eps and minpts. The following is a pseudo-code of the DBScan algorithm.

figurea

In Alg. 1, between lines 2 and 18, a for loop is executed for the number of points in the dataset. In step 2, DBSCAN assigns D [i] as the focal point. In step 3, the distance between the focal point D [i] and the remaining points is calculated, and then in step 4 points whose distance is less than or equal to EPS are accepted as adjacent points of the focal point. In step 5, the number of adjacent points of the focal point is calculated. Step 6 of the algorithm checks if the main point is on the adjacent points’ list. In step 7, the algorithm checks the number of adjacent points in order to use the largest or equal minpt. If so, the center is designated as a main point. Between lines 8 and 12, a unique class is assigned to the focal point D [i] and the adjacent users. If the focal point is not a focal point but close to a main point, we define it as a focal point between steps 13 to 15. If the focal point is neither a main point nor a boundary point and the distance of points from eps is greater, we consider it as out-of-range data. Finally, in line 19, the algorithm presents the results. Table 2 shows the demographic information of 5 users.

Table 2 Demographic information of 5 users

As can be seen in Table 2, the second column shows the gender, the third column shows the age and the fourth column shows the users’ occupation according to the table below. Table 3 shows the values of the jobs defined in the users’ demographic file.

Table 3 The values of the jobs defined in the user demographic file

As we see, each occupation has a code. The number of key users clustered in the current study is 6040. Therefore, after applying DBScan clustering algorithm, the results can be presented in Table 4.

Table 4 Clustered user data

After applying the clustering algorithm, all users are clustered based on demographic information (age, gender, and occupation). Each user is assigned a cluster label. These labels are used as categories for each user. After clustering, users should be divided into two groups. The first category, which accounts for 70 % of all users, is used to train DNNs and generate models. The second category includes new users who constitute 30 % of all users. The limitations of DBScan clustering algorithm are as follows:

  • In case of detecting clusters with different densities and when clusters are close.

  • One of the most important problems is to determine the parameters.

  • It does not work well in case of high dimensional data and high-volume databases.

Separation of training and test samples

One of the important phases to train the DNN algorithm is dividing samples into two main parts. The first part is used to train the DNN algorithm model and the second part is used as a test case for categorizing new users. Sampling is one of the stages of data mining which considered in the proposed solution.

Classification of new users with DNN

At this stage, using the demographic information of the new users and the clusters specified in the previous step, the appropriate category for the new users can be found. Alg. 2 shows the steps of cluster selection using DNN method.

As can be seen in Fig. 5, the training data that is the output of the clustering stage is processed by DNN method and the desired model is generated. Then the new user enters the system as the test data and is assigned to a category. When the cluster or category of the new is determined, its adjacent users which include the users of that category are extracted. The comments of the adjacent users are taken into consideration in movie recommendations. One of the most important reasons for using DNN algorithm is to support a large number of hidden layers and high classification accuracy when having huge amount of data.

Users are first clustered using the DBScan clustering algorithm and a label is assigned to each user. Users are clustered based on demographic information. Users are then divided into two categories of training and test data and are trained by the DNN algorithm. Eventually a model is produced and new users are categorized and placed in a category. All of these processes take place at the core of the content-based system.

Phase 2: CRS based on hybrid similarity criterion  

After determining the category for new users, phase 2 of the proposed method begins. The phase includes a CRS based on new similarity criteria. In this phase, similarities are calculated based on a threshold (lambda) between the new user and the users in the selected category. Similarity criteria are determined based on age, gender, and occupation. The CRS extracts users who are the most similar to the new user. Then the higher rated movie services are suggested to the new user based on adjacencymatrix.

Suppose that the total number of users in the system is shown as \(U=\{{u}_{1} \cdot {u}_{2} \cdot {u}_{3} \dots {u}_{m}.\}\), new users are as\(N=\{{n}_{1} \cdot {n}_{2} \cdot {n}_{3} \ldots {n}_{n}\}\), Collection of products as \(I=\{{i}_{1} \cdot {i}_{2} \cdot {i}_{3} \ldots {i}_{k}\}\), and the demographic information of users as \(D=\{{d}_{1} \cdot {d}_{2} \cdot {d}_{3} \ldots {d}_{l}\}\). Therefore, once the target cluster or category is selected for the new user, adjacent users who are similar to that user \({ n}_{j}\in N\) should be extracted. The corresponding algorithm for finding the nearest adjacent users is shown in Algorithm (2) in the next section.

As can be seen in the first line, the number of users is entered in the algorithm as input. Line three defines the list the users. In line four, a category is defined for the new user. In lines 5 through 12, users near to the new user are calculated. Therefore, when adjacent users are initialized in NG format, the similarity between the new users and the users in the NG list should be calculated. Finally, the ultimate prediction is based on the rates of the adjacent users. The results of the similarity criteria are based on the demographic characteristics of the users as indicated by the set “D”. The final equation for calculating the similarity between new users and adjacent users is obtained using the following equation.

$$sim\left(n \cdot u\right)=\frac{\sum_{j=1}^ISF_j\ast w_j}{\sum_{j=1}^Iw_j}$$
(1)

where SFj is the similarity value of the characteristic j and wj is the weight of the attribute in question. In the present study, we have used characteristics like age, gender, and occupation to cluster users and calculate their similarities and based on their importance they are assigned with different weights. For example, 0.5 is assigned to age, 0.25 is assigned to gender and 0.25 is assigned to occupation. These values can also be varied but the total sum must be equal to 1. So:

\(D=\{d1=age. d2=gender . d3=Occupation\}\) And the set of weights are: W = \(\left\{w1=0.5 . w2=0.25 . w3=0.25\right\}=1\).

In this paper, a hybrid criterion is used to calculate the similarity between users. For each characteristic dj we have defined an SF function (at1, at2) with values between [0,1]. This function calculates the similarity of two characteristics associated with a pair of users. Given the nature of the characteristics that we consider to calculate similarities, there are two general groups of features:

Numeric features

For numeric features such as age, a similarity criterion is defined as follows (2):

$${\text{w}}_{\text{a}\text{g}\text{e}}= \begin{array}{ll}{\left(1-\frac{\left|\text{D}\text{i}\text{f}\text{f}\right|}{{\text{D}\text{i}\text{f}\text{f}}_{\text{m}\text{a}\text{x}}}\right)}^{{\upbeta }}& \text{i}\text{f} \quad \left|\text{D}\text{i}\text{f}\text{f}\right|\le {\text{D}\text{i}\text{f}\text{f}}_{\text{m}\text{a}\text{x}}\\ 0& \text{i}\text{f} \quad \left|\text{D}\text{i}\text{f}\text{f}\right|> {\text{D}\text{i}\text{f}\text{f}}_{\text{m}\text{a}\text{x}}\end{array}$$
(2)

In the equation above, Diff represents the age difference between users and Diffmax is the maximum difference defined by the researcher. If the researcher wants to increase the value of wage to the value of Diff, he should simply set the value of β less than 1.

String features

In Eq. (3), the equation for calculating the similarity based on the string features is shown.

$${\text{W}}_{\text{g}\text{e}\text{n}\text{d}\text{e}\text{r}}= \begin{array}{c}1 \quad \text{i}\text{f} \;\;\text{a}\text{t}\text{t}\;\;1==\text{a}\text{t}\text{t}2\\ 0 \quad \text{i}\text{f} \;\;\text{a}\text{t}\;\;1<>\text{a}\text{t}\text{t}2\end{array}$$
(3)

Considering that whether the feature values of 1 and 2 are the same or not, the value of 1 or 0 is reset.

Formation of adjacency matrix

After obtaining the similarity level of the new user to other adjacent users using the similarity criteria, the adjacency matrix associated with the rates given by the adjacent users should be created for the ratings and through using a prediction formula which will be further explained, the ratings given by each user to the desired product should be calculated with the similarities obtained in the previous step and predict the rating as the success factor of the target product in order to recommend it to the new user. It should be noted that the highest rated products are recommended as superior products. The general form of the adjacency matrix for a product and user is as follows.

$$\begin{array}{ccc}\begin{array}{ccc}& {\text{I}}_{1}& {\text{I}}_{2}\\ \text{u}1& 5& 4\\ \text{u}2& 3& 5\end{array}& \begin{array}{ccc}{\text{I}}_{3}& {\text{I}}_{4}& {\text{I}}_{5}\\ 4& 5& 2\\ 5& 4& 5\end{array}& \begin{array}{ccc}{\text{I}}_{6}& {\text{I}}_{7}& {\text{I}}_{\text{b}}\\ 5& 5& 3\\ 4& 5& 5\end{array}\\ \begin{array}{ccc}\text{u}3& 4& 3\\ \text{u}4& 5& 5\\ \text{u}5& 1& 2\end{array}& \begin{array}{ccc}5& 4& 3\\ 5& 5& 2\\ 4& 3& 1\end{array}& \begin{array}{ccc}3& 4& 4\\ 2& 4& 3\\ 3& 4& 5\end{array}\\ \begin{array}{ccc}\text{u}6& 2& 5\\ \text{u}7& 5& 4\\ \text{n}\text{j}& & \end{array}& \begin{array}{ccc}3& 2& 3\\ 5& 1& 5\\ & & \end{array}& \begin{array}{ccc}4& 5& 4\\ 5& 5& 5\\ & & \end{array}\end{array}$$
(4)

As stated in relation (4), user 1 gives item 1 a rating of 5 and rating 3 to item 2. In this paper we simulate all rated items in order to generate the adjacency matrix of all users in the adjacent list and the new user and then we predict the desired value for the new user in the list.

Predicting new user’s rating

After forming the adjacency matrix in the previous step, the value of rates given by the users should be calculated for the new user and presented as prediction. So, we make predictions for the new user in the final phase. For each nj user, the proposed model must predict values for the item Ib. Rnj,ib is a predicted rating assigned to item b by the new user. The predicted rating for each user is obtained using the following equation.

$${R}_{{n}_{j} \cdot {i}_{b}}=\frac{\sum _{u\in NG}Sim\left({n}_{j} \cdot u\right) \cdot {r}_{u \cdot {i}_{b}}}{\sum _{u\in NG}Sim({n}_{j} \cdot u)}+TF$$
(5)

ru,ib is the rating given to item i by the user u in adjacent users’ list. Therefore, using the aforementioned prediction formula, the predicted ratings for each item will be approximately predicted the prediction which has the highest rating will be selected. The TF value that indicates the new or old user is effective in the provided rating. If the user is new, the value is TF = 0 and the doubled rating given by the researcher tends to vary from 0.1 to 1, which may affect the predicted total rating.

After defining the movie services for the new user, phase 3 which is FriendLink Algorithm, begins. At this point, the movies are transmitted through user communication dataset. In this phase, movies which are based on the relationship between those users who are similar to the new user are selected. So far, different similarity criteria have been proposed to calculate the similarity between user’s X and Y in graph G of social networks. Most of these criteria use the degree of the nodes and their adjacency in the network to calculate overall similarities based on the link. The proposed similarity criterion is based on the following four key factors in the social network graph. These factors include:

  • Degree of the nodes.

  • User popularity.

  • Number of routes.

  • Node Balance.

.

Based on these factors, relation (6) is defined:

$${Sim}_{MyApp}\left(X \cdot Y\right)=AVG \left(\left(\frac{\left|Neib\left(Y\right)\right|}{N}\right)+ \frac{\left|{Paths}_{X \cdot Y}^{l}\right|}{\left|{Paths}_{X \cdot Y}\right|}+ \frac{{K}_{Y}}{{\sum }_{i=1}^{n}D\left({K}_{i}\right)}+ \frac{|Neib\left(X\right) \cap Neib\left(Y\right)|}{Avg({K}_{X} \cdot {K}_{Y})}\right)$$
(6)

In the above relationship, \({Sim}_{MyApp}\left(X \cdot Y\right)\) indicates the proposed similarity criterion for predicting links in social networks. \(\left|Neib\left(Y\right)\right|\) denotes the number of adjacent users of node Y, N is the sum of all nodes in the graph, \(\left|{Paths}_{X \cdot Y}^{l}\right|\) indicates the number of paths with lengths L from node X to node Y in graph G. \(\left|{Paths}_{X \cdot Y}\right|\)  is the total number of paths leading to node Y, \({K}_{Y}\) is the degrees of the node Y, \({\sum }_{i=1}^{n}D\left({y}_{i}\right)\)is the total graph degree, n is the number of nodes of graph G, \(\left|Neib\left(X\right) \cap Neib\left(Y\right)\right|\) is the number of shared adjacent users of node X and node Y, \(Avg({K}_{X} \cdot {K}_{Y})\) is the average degree of node X and node Y. Finally, relation (6) can be used to obtain the degree of similarity between the user’s X and Y. One of the most important features of this criterion is that it calculates the similarity between two users on social network with great accuracy. This level of accuracy makes recommended users or friends more attractive.

Phase 3: Improved Friendlink algorithm

In this phase, which is almost the ultimate phase, the improved Friendlink algorithm is run on the dataset to calculate the similarity between users who are connected though the link. In this section, we first describe the Link Prediction via Friendlink and the improved algorithm. Friendlink algorithm is a link prediction algorithm that is widely used to predict future links, especially dating, on social networks.

$$Sim\left({n}_{j} \cdot {u}_{j}\right)=\sum _{i=2}^{l}\frac{1}{i-1} \cdot \frac{\left|{path}_{{n}_{j} \cdot {u}_{j}}^{i}\right|}{{\prod }_{y=2}^{i}(n-y)}$$
(7)

ni is a new user who has recently entered the social network graph and uj is the target user,  L: specifies the path length, n: specifies the total number of graph nodes.

\(\sum _{i=2}^{l}\frac{1}{i-1}\) It is a weighting factor which is more effective for paths whose length is more than L = 2. Suppose that the maximum path length is 2 (L = 2). in this case \(\sum _{i=2}^{l}\frac{1}{i-1}=1\) and has no effect on the ultimate similarity. When L = 3, this weight is changed to 0.5 and has a significant effect. \(\left|{path}_{{n}_{j} \cdot {u}_{j}}^{i}\right|\) is the number of paths between the origin and destination users which is shown by their length? Consider the following figure. Assuming that the new user is u1 and the associated users are u2, u3, u5, u6, u8, the maximum path length to u4 that is a friend user suggested to the new user (u2, u3) is 3. \({\prod }_{y=2}^{i}(n-y)\) is the number of possible paths from new user to target user? One of the most important advantages of using this similarity criterion is that it does not stop at lengths 1 and continues running to length 2, 3, and 4 in the graph. Consider Fig. 5.

Fig. 5
figure5

An example of the social network and the Friend Link algorithm

As shown in the figure, the origin user (green arrow) which is located in the corresponding graph has 5 (2, 3, 5, 6 and 8) direct friends. The basic algorithm of Friend Link first obtains the graph of the whole network and after calculating the adjacent users of the source node, calculate the best paths to reach the other nodes considering the path length and finally according to the presented formula, calculates the similarity. We have also improved Friendlink formula to improve the accuracy of link prediction. Applying the adjacency degree of the nodes the formula will change as follows.

$$Sim\left({n}_{j} \cdot {u}_{j}\right)=\left(\sum _{i=2}^{l}\frac{1}{i-1} \cdot \frac{\left|{path}_{{n}_{j},{u}_{j}}^{i}\right|}{{\prod }_{y=2}^{i}\left(n-y\right)}\right)\text{*}\left(\frac{{D}_{{u}_{j}}}{N}\right)$$
(8)

N is the total number of degrees of nodes. \({D}_{{u}_{j}}\) is the degree of the target node? So, we try to improve the results of the paper by using the Friendlink algorithm and improving the similarity formula. In the proposed method, the presented method is called Pro-Friendlink. The important thing about Pro-Friendlink algorithm is that it gives more importance to the friends that are more connected to the users and ultimately recommend users who have more credibility in the social network to the target user. It makes it possible to recommend more attractive and popular users to the new users.

An example of Pro-Friendlink algorithm is presented here. Pro-Friendlink prediction method calculates the similarity between nodes in a unidirectional graph so that users’ credibility is taken into account. Pro-Friendlink algorithm receives G-graph communications as input, and after generating the adjacency matrix, calculates the similarity between the two nodes and indicates it as output. Consequently, the friend suggestions can be based on the weights calculated by Pro-Friendlink prediction algorithm in the adjacency matrix. Figure 3 shows the Pro-Friendlink prediction algorithm. If we want to suggest a new friend to user U1, there is no direct indication of this in the adjacency matrix shown in Table 1. After running Pro-Friendlink algorithm, we can find the similarity matrix between the two nodes of the G graph and suggest friends based on importance.

In the proposed method, we first modify the adjacency matrix A displayed in Table 2. So that instead of having values of 0 and 1, the input (i.j) is a list of paths from node i to node j. The basic idea is that if the adjacency matrix A, which contains 0/1 of a graph, is increased to a power N by the adjacency matrix, then the result of the input data (i, j) shows how long the path from node VI to node VJ There is. Then, instead of counting the routes, we will look for all real routes.

Phase 4: Combining link system and recommender system

Phase 4 of the proposed method is related to the combination of CRS’s output and improved Friendlink algorithm. In Phase 4, which is the final phase in the proposed system, the results of Phases 1 and 2 are combined with Phase 3, and the jointly selected users are sent as the final proposal. Suppose that the results of Phase 1 and 2 were users 1, 2, 4, 5, 6, 9 and 10. the results of phase 3 were users 3, 4, 5, 6, 9 and 10. The results are both combined and users 4, 5, 6, 9, 10 are suggested as attractive users to the new user.

Results

In this section, the proposed problem database, the evaluation criteria and the obtained results are explained.

Dataset

In this paper, simulation on Movielens dataset is performed to investigate the issue and evaluate the results. To access the data source used, simply refer to [22] and select and download the desired data from the versions provided. The version used in this study was 2013 whose size is 1 MB. These files contain 1,000,209 recordings of user ratings, 3900 movie samples and 6040 user samples, with each user rating at least 20 movies.

Evaluation metrics

MAE and RMSE are used as evaluative criteria.

$$MAE= \frac{1}{k} \sum _{u \cdot i}|{P}_{u \cdot i}-{r}_{u \cdot i}|$$
(9)
$$RMSE=\sqrt{{\frac{1}{k} \sum _{u \cdot i}({P}_{u \cdot i}-{r}_{u \cdot i})}^{2}}$$
(10)

where Pu, i is the predicted rating of user u to the movie i and ru, i is the actual rating of user u to the movie i. Here are some of the scenarios that are illustrated in Table 5.

Table 5 Scenarios

As can be seen, different results are obtained according to different weightings for age, sex, and occupation. The above scenarios are defined with different weights. Weights that have more features are actually more focused on the feature and have more similarity effects. The following table calculates the mean prediction errors of the proposed method using MAE and RMSE and compares them with that of other methods. The stage considered for the following results is stage 1 (Table 6).

Table 6 Comparison of MAE, root RMSE of proposed method with other methods = 100 (Step 1)

As can be seen, the proposed method with 100 users as input has an MAE of 0.35 and a RMSE of 0.59. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 2 (Table 7).

Table 7 Comparison of MAE and root RMSE of proposed method with other methods = 100 (Step 2)

As can be seen, the proposed method with 100 users as input has an MAE of 0.35 and a RMSE of 0.59. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 3 (Table 8).

Table 8 Comparison of MAE and root RMSE of proposed method with other methods with number of users = 100 (Step 3)

As can be seen, the proposed method with 100 users as input has an MAE of 0.35 and a mean square error of 0.59. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method with 500 users as input and compares it to the other methods considering Step 1 (Table 9).

Table 9 Comparison of MAE and root RMSE in proposed method with other methods = 500 (Step 1)

As can be seen, the proposed method with 500 users as input has an MAE of 0.76 and a RMSE of 1.03. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 2 (Table 10).

Table 10 Comparison of MAE and root RMSE of proposed method with other methods = 500 (Step 2)

As can be seen, the proposed method with 500 users as input has an MAE of 0.76 and a RMSE of 1.03. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 3 (Table 11).

Table 11 Comparison of MAE and root RMSE of proposed method with other methods with number of users = 500 (Step 3)

As can be seen, the proposed method with 500 users as input has an MAE of 0.76 and a RMSE of 1.03. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method with 900 users as input and compares it to the other methods considering Step 1 (Table 12).

Table 12 Comparison of MAE and root RMSE of proposed method with other methods with number of users = 900 (Step 1)

As can be seen, the proposed method with 900 users as input has an MAE of 0.73 and a RMSE of0.95. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 2 (Table 13).

Table 13 Comparison of MAE and root RMSE of proposed method with other methods = 900 (Step 2)

As can be seen, the proposed method with 900 users as input has an MAE of 0.73 and a RMSE of 0.95. Therefore, the proposed method is much more accurate than the other methods outlined in the table above. The following table calculates the mean prediction errors using MAE and RMSE in the proposed method and compares it to the other methods considering Step 3 (Table 14).

Table 14 Comparison of MAE and root RMSE in proposed method with other methods = 900 (Step 3)

As can be seen, the proposed method with 900 users as input has an MAE of 0.73 and a RMSE of 0.95. Therefore, the proposed method is much more accurate than the other methods outlined in the table above.

Finally, the table below summarizes the evaluation of the proposed method with 100 users and comparing it with other methods using MAE (Table 15).

Table 15 Evaluation of the proposed method with other methods with 100 users

The following table also summarizes the evaluations of steps 1, 2 and 3 with 500 users (Table 16).

Table 16 Evaluations of steps 1, 2 and 3 with 500 users

Table 17 also summarizes the evaluations of steps 1, 2 and 3 with 900 users.

Table 17 Evaluations of steps 1, 2 and 3 with 900 users

Based on the evaluation of the proposed method and comparing it with other methods of research conducted in 2014 and 2015, the diagram below compares MAE of the proposed method with other basic methods such as decision tree algorithms, Naïvebays, Random Classification [20] as well as algorithms like SVD and ApproSVD [26].

In other reviews we enable our paper with [24]. Chu-Hsing Lin et al. in [24], presented the neural network for movie recommendation system. Table 6, shows the comparison of the proposed method using boosting approach with Scikit-learn, TensorFlow. Scikit-learn is an early machine learning tool with many ready-made libraries and functions to call, and has powerful and fast mathematics capabilities. TensorFlow is a tool that has emerged in recent years with the development of deep learning.

As can be seen in Table 18, the proposed method performs better than Scikit-learn and TensorFlow methods with and without applying neural network. Also, this method has a lower processing time comparing with other methods.

Table 18 Comparison of the proposed method using boosting approach with Scikit-learn, Tensor Flow

Table 19 shows the precision and accuracy metrics compression of the proposed method in this study with other classification algorithms.

Table 19 Precision and accuracy metrics compression of the proposed method in this study with other classification algorithms

As can be seen in Table 19, the precision metric of the proposed method is equal to 98.92 % and the accuracy of the recommendations in the proposed method is equal to 93.9 %. Therefore, the rate of accuracy improvement of the proposed method compared to other classification methods such as Decision Tree, Neural Network, SVM, Naïvebays, KNN, Random forest is equal to 2.9 %, 7.9 %, 5.9 %, 6.8 %, 1.9 % and 11.4, respectively. The rate of precision improvement of the proposed method compared to other classification methods such as Decision Tree, Neural Network, SVM, Naïvebays, KNN, Random forest is equal to 6.22 %, 5.52 %, 1.22 %, 2.42 %, 2.22 % and 4.82 %, respectively.

Discussion

The main purpose of this paper is to solve cold start problem in online movie networks and to introduce appropriate movies to new users with acceptable accuracy. To do this, CBRSs and collaborative filtering as well as clustering techniques and DNN were used. In this paper, the researcher applied clustering techniques, DNN, hybrid similarity criteria, and improved Friend link algorithm as methods which are much more accurate than any other methods used before to provide new users with appropriate movies. Therefore, given the simulation to provide attractive movies to new users who are experiencing a cold start, the proposed method, compared to other methods, recommends desired movies to users in a timely manner. So another major issue is the trust issue that arises from disregarding older users. The proposed method consists of four steps:

  1. 1.

    Initial clustering of all users and assigning new users to appropriate clusters;

  2. 2.

    Assigning appropriate weights to the characteristics of the target cluster users and determining the adjacent users.

  3. 3.

    Forming adjacency matrix of adjacent users’ ratings to the existing movies and calculating the new user’s ratings considering adjacent users’ ratings and similarity level that exists between the users. At this stage doubled rating opportunity is created for loyal users.

  4. 4.

    Using Friendlink algorithm to introduce similar users to the new user and to combine Step 4 with Step 3.

Finally, to evaluate the prediction error of the proposed method in comparison with other similar methods such as C24.5, CM4.5, RCA and Naïvebays method, MAE and RMSE evaluation criteria were used. The error rate of the proposed method is less than other similar methods and has indicated acceptable accuracy in introducing movies to new users.

Conclusions

In our future work, we would like to focus on several areas. Here are some recommendations for further research: (1) in the present paper, the k-mean clustering technique is used to cluster users with the number of k-tests obtained by trial and error. Therefore, techniques such as random-walk clustering algorithm or improved clustering algorithm can be used and the results can be compared and evaluated with the current and other methods of clustering. (2) In this paper, the idea of DNN is used to assign new users to the desired cluster to select the categories that are most suitable for users. Therefore, future works can replace other techniques and compare the results with the proposed method. This method can be implemented for optimal operation by fog computing in distributed models. The most important limitations of this research are as follows:

  • Execution and processing time of the proposed method is longer than some methods.

  • The proposed method cannot be implemented on every online site and system.

  • The methods used need improvements on big data.

Availability of data and materials

The datasets generated and/or analysed during the current study are not publicly available due [REASON WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.

References

  1. 1.

    Eirinaki M, Gao J, Varlamis I, Tserpes K. Recommender systems for large-scale social networks: a review of challenges and solutions.

  2. 2.

    Najafabadi MK, Mohamed AH, Mahrin MN. A survey on data mining techniques in recommender systems. Soft Comput. 2019;23(2):627–54.

    Article  Google Scholar 

  3. 3.

    Silveira T, Zhang M, Lin X, Liu Y, Ma S. How good your recommender system is? A survey on evaluations in recommendation. Int J Mach Learn Cybern. 2019;10(5):813–31.

    Article  Google Scholar 

  4. 4.

    Tatiana K, Mikhail M. Market basket analysis of heterogeneous data sources for recommendation system improvement. Procedia Comput Sci. 2018;136:246–54.

    Article  Google Scholar 

  5. 5.

    Hoang L. HU-FCF++: a novel hybrid method for the new user cold-start problem in recommender systems. Eng Appl Artif Intell. 2015;41:207–22.

    Article  Google Scholar 

  6. 6.

    Pera MS, Ng YK. A group recommender for movies based on content similarity and popularity. Inf Process Manag. 2013;49(3):673–87.

    Article  Google Scholar 

  7. 7.

    Christensen IA, Schiaffino S. Entertainment recommender systems for group of users. Expert Syst Appl. 2011;38(11):14127–35.

    Google Scholar 

  8. 8.

    Camacho LA, Alves-Souza SN. Social network data to alleviate cold-start in recommender system: a systematic review. Inf Process Manag. 2018;54(4):529–44.

    Article  Google Scholar 

  9. 9.

    Van Meteren R, Van Someren M. Using content-based filtering for recommendation. In:  Proceedings of the machine learning in the new information age: MLnet/ECML2000 Workshop, vol. 30. 2000. p. 47–56.

  10. 10.

    Basu C, Hirsh H, Cohen W.  Recommendation as classification: using social and content-based information in recommendation. In: Aaai/iaai. 1998. p. 714–20.

  11. 11.

    Chen J, Wang H, Yan Z. Evolutionary heterogeneous clustering for rating prediction based on user collaborative filtering. Swarm Evol Comput. 2018;1:35–41.

    Article  Google Scholar 

  12. 12.

    Benkhelifa R, Bouhyaoui N, Laallam FZ. A demographic-based approach for improved content categorization in social networking. In: 2018 2nd international conference on natural language and speech processing (ICNLSP). New York: IEEE; 2018. p. 1–5.

  13. 13.

    Nirenburg S, Carbonell J, Tomita M, Goodman K. Machine translation: a knowledge-based approach. Burlington: Morgan Kaufmann Publishers Inc.; 1994.

    Google Scholar 

  14. 14.

    Ha T, Lee S. Item-network-based collaborative filtering: a personalized recommendation method based on a user’s item network. Inf Process Manag. 2017;53(5):1171–84.

    Article  Google Scholar 

  15. 15.

    Sharma L, Gera A. A survey of recommendation system: research challenges. Int J Eng Trends Technol. 2013;4(5):1989–92.

    Google Scholar 

  16. 16.

    Ghazanfar MA, Prugel-Bennett A. A scalable, accurate hybrid recommender system. In: 2010 third international conference on knowledge discovery and data mining. New York: IEEE; 2010. p. 94–8.

  17. 17.

    Kim HN, El-Saddik A, Jo GS. Collaborative error-reflected models for cold-start recommender systems. Decis Support Syst. 2011;51(3):519–31.

    Article  Google Scholar 

  18. 18.

    Bobadilla J, Ortega F, Hernando A, Bernal J. A collaborative filtering approach to mitigate the new user cold start problem. Knowl Based Syst. 2012;26:225–38.

    Article  Google Scholar 

  19. 19.

    Byström H. Movie recommendations from user ratings.

  20. 20.

    Lika B, Kolomvatsos K, Hadjiefthymiades S. Facing the cold start problem in recommender systems. Expert Syst Appl. 2014;41(4):2065–73.

    Article  Google Scholar 

  21. 21.

    Pereira AL, Hruschka ER. Simultaneous co-clustering and learning to address the cold start problem in recommender systems. Knowl Based Syst. 2015;1:11–9.

    Article  Google Scholar 

  22. 22.

    MovieLens GroupLens. 2015. http://grouplens.org/datasets/movielens/2013.

  23. 23.

    Kutty S, Chen L, Nayak R. A people-to-people recommendation system using tensor space models. In: Proceedings of the 27th annual ACM symposium on applied computing. 2012. p. 187–92.

  24. 24.

    Lin CH, Chi H. A novel movie recommendation system based on collaborative filtering and neural networks. In: International conference on advanced information networking and applications. Cham: Springer; 2019. p. 895–903.

  25. 25.

    Walek B, Fojtik V. A hybrid recommender system for recommending relevant movies using an expert system. Expert Syst Appl. 2020;13:113452.

    Article  Google Scholar 

  26. 26.

    Zhou X, He J, Huang G, Zhang Y. SVD-based incremental approaches for recommender systems. J Comput Syst Sci. 2015;81(4):717–33.

    MathSciNet  Article  Google Scholar 

Download references

Acknowledgements

Not applicable.

Funding

Not applicable.

Author information

Affiliations

Authors

Contributions

All authors contributed to developing the ideas, and writing and reviewing this manuscript. All authors read and approved the fnal manuscript.

Corresponding author

Correspondence to Akbar Etebarian.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Vahidi Farashah, M., Etebarian, A., Azmi, R. et al. A hybrid recommender system based-on link prediction for movie baskets analysis. J Big Data 8, 32 (2021). https://doi.org/10.1186/s40537-021-00422-0

Download citation

Keywords

  • Hybrid recommender system
  • Cold start problem
  • Link prediction
  • Hybrid similarity criterion
  • Deep neural network