In this study, a similarity algorithm called UPCSim is proposed. This UPCSim algorithm is embedded into an MBCF system, which was developed by Wu et al. [31]. Figure 1 details the proposed MBCF system.
We divide the system into four blocks: input, data preparation, the MBCF process, and output. The input block is the input dataset used in the MBCF system. The data preparation block consists of the data pre-processing stage, which results in a clean dataset. The data source includes the rating data and behavior data. While the behavior data used in Wu et al. [31] only employs the genre data (the green component in the data preparation block), our research also accommodates the user profile data (the red component in the data preparation block). The MBCF process block is the development of the MBCF method using the similarity weighting. The similarity weighting carried out by Wu et al. uses a threshold value ranging from 0 to 1 (the green component in the MBCF process). Our research's similarity weighting uses the correlation coefficients between the user profile data and the user rating or behavior values (the red component in the MBCF process). Finally, the output block evaluates the UPCSim algorithm in the MBCF method.
The detail of our proposed UPCSim algorithm is explained as a similarity calculation's component, which is presented in the “Similarity calculation” subsection. The detail of the developed MBCF system is described in “The developed MBCF system” subsection.
Similarity calculation
In this study, we divide the similarity calculation between users into three components. The first s the \(S_r\) similarity calculation component (shown in the dashed blue box). The second is the \(S_b\) similarity calculation component (shown in the dashed green box). Finally, the UPCSim component (shown in the dashed red box) gives weights to both similarities. Figure 2 illustrates he three components in our similarity calculation.
Each component in Fig. 2 is described as follows.
\(S_r\) similarity
The \(S_r\) similarity component is the similarity calculation based on the user rating value. As an example of the similarity calculation, we used the MovieLens 100K dataset. The initial stage in calculating the \(S_r\) similarity is performed by reading the rating data from the resulted pre-processing data. Based on the rating data, we obtain a user rating value matrix of order \(943 \times 1682\). The number 943 represents the number of users, and the number 1682 represents the number of movies in the dataset, shown as follow.
$$R= \left[ \begin{array}{cccccc} R_{11}& R_{12}& R_{13}& R_{14}& \cdots & R_{1\_1682}\\ R_{21}& R_{22}& R_{23}& R_{24}& \cdots & R_{2\_1682}\\ \cdots & \cdots & \cdots & \cdots & \cdots & \cdots \\ R_{943\_1}& R_{943\_2}& R_{943\_3}& R_{943\_4}& \cdots & R_{943\_1682} \end{array} \right]$$
\({R_{943\_1682}}\) is the user rating value given by the \(943\text{rd}\) user for the \(1682\text{nd}\) item. The values of \(R_{11}\) to \(R_{943 1682}\) range from 0 to 5, with a value of 0 indicates the user unrated the item.
After forming the user rating value matrix, the next step is to calculate the \(S_r\) similarity using the Cosine similarity formula referred to (1). The final result of the \(S_r\) similarity calculation forms the \(S_r\) similarity matrix of order \(943 \times 943\), shown as follow.
$$S_r= \left[ \begin{array}{ccccc} S_{11}&S_{12}&S_{13}& \cdots &S_{1\_943}\\ S_{21}&S_{22}&S_{23}& \cdots &S_{2\_943}\\ \cdots & \cdots & \cdots & \cdots & \cdots \\ S_{943\_1}&S_{943\_2}&S_{943\_3}& \cdots &S_{943\_943} \end{array} \right]$$
\(S_{1\_943}\) is the similarity value based on the rating between the \(1\text{st}\) user and the \(943\text{rd}\) user.
\(S_b\) similarity
The \(S_b\) similarity component is the similarity calculation based on the user behavior value. This behavior value is obtained by finding the relationship between rating data and item data. In the MovieLens 100K dataset, the rating data describes the user rating value of each movie. Meanwhile, the item data represents the movie title data containing the genre information of each movie. Each movie title can include several genres. For example, the movie “Toy Story” has animation, children, and comedy genres.
After formulating the user behavior value, the next process is removing some unused attributes from the relationship between the rating data and the item data, and then performing data aggregation using the sum function grouped by user. These data aggregation results are illustrated in the user behavior value matrix of order \(943\times 19\). The number 943 represents the number of users, and the number 19 represents the number of genres, shown as follow.
$$B= \left[ \begin{array}{ccccc} B_{11}& B_{12}& B_{13}& \cdots & B_{1\_19}\\ B_{21}& B_{22}& B_{23}& \cdots & B_{2\_19}\\ \cdots & \cdots & \cdots & \cdots & \cdots \\ B_{943\_1}& B_{943\_2}& B_{943\_3}& \cdots & B_{943\_19} \end{array} \right]$$
\(B_{943\_19}\) is the \(943\text{rd}\) user behavior value for the \(19\text{th}\) genre, representing the total number of \(19\text{th}\) genre watched by \(943\text{rd}\) user. After forming the user behavior value matrix, the next stage is to calculate the probability of genre occurrence from the user behavior value matrix to produce a probability matrix of user behavior value using (5).
B(g) is the user behavior value for the target genre g, and N is the total number of users who give rate to the target genre g. The illustration of the probability matrix of user behavior value is shown as follow.
$$P= \left[ \begin{array}{ccccc} P_{11}& P_{12}& P_{13}& \cdots & P_{1\_19}\\ P_{21}& P_{22}& P_{23}& \cdots & P_{2\_19}\\ \cdots & \cdots & \cdots & \cdots & \cdots \\ P_{943\_1}& P_{943\_2}& P_{943\_3}& \cdots & P_{943\_19} \end{array} \right]$$
\(P_{943\_19}\) is the probability value of the \(943\text{rd}\) user behavior for the \(19\text{th}\) genre. The probability matrix of user behavior value is used for calculating the \(S_b\) similarity referring to (4). The results of the \(S_b\) similarity calculation forms a matrix of order \(943\times 943\), shown as follow.
$$S_b= \left[ \begin{array}{ccccc} S_{11}& S_{12}& S_{13}& \cdots & S_{1\_943}\\ S_{21}& S_{22}&{}S_{23}& \cdots & S_{2\_943}\\ \cdots & \cdots & \cdots & \cdots & \cdots \\ S_{943\_1}& S_{943\_2}& S_{943\_3}& \cdots & S_{943\_943} \end{array} \right]$$
\(S_{1\_943}\) is the similarity based on the user behavior value between the \(1\text{st}\) user and the \(943\text{rd}\) user.
UPCSim
The UPCSim is a component of the similarity calculation using the UPCSim algorithm, which calculates the weights of both similarities (\(S_r\) and \(S_b\)) based on the user profile attributes viz age, gender, occupation, and location as provided in the MovieLens 100K dataset. The weights of these two similarities are calculated based on the correlation coefficient (R) using multiple linear regression.
The general formula of the multiple linear regression and correlation coefficient (R) are defined in (6) and (7), respectively.
$$Y=a+b_1X_1+b_2X_2+b_3X_3+ \cdots +b_nX_n$$
(6)
$$R=\sqrt{\frac{b_1\sum X_1Y+b_2 \sum X_2Y+b_3\sum X_3Y+ \cdots +b_n\sum X_nY}{\sum Y^2}}$$
(7)
Y is dependent variable, X is independent variables, a is a constant, and b is the regression coefficient for each independent variable. In our study, Y represents user rating value or user behavior value, and X denotes user profile data with four independent variables (namely age, gender, occupation, and location).
The weight of \(S_r\) similarity is obtained by calculating the correlation coefficient between the user profile data (age, gender, occupation, and location) and the user rating value and is denoted by \(\alpha\). The weight of \(S_b\) similarity is obtained by calculating the correlation coefficient between user profile data (age, sex, occupation, and location) and the user behavior value and is symbolized by \(\beta\).
After weighting both similarities, the next stage is to calculate the final similarity matrix by combining the weighted \(S_r\) and \(S_b\) similarities. The final similarity matrix S of order \(943\times 943\) between user u and user v is defined in (8).
$$S(u,v)=\alpha S_r(u,v)+\beta S_b(u,v)$$
(8)
S(u, v) is the final similarity between user u and user v. \(S_r (u,v)\) is the similarity based on user rating value between user u and user v. \(S_b (u,v)\) is the similarity based on user behavior value between user u and user v. \(\alpha\) is the weight of the similarity \(S_r\), and \(\beta\) is the weight of the similarity \(S_b\).
The developed MBCF system
Based on the illustration shown in Fig. 1, this subsection describes each block of the developed MBCF system.
Input
The first block of the developed MBCF system is the input dataset. In this paper, we used the MovieLens dataset collected by the “GroupLens Study Group of the University of Minnesota” [35]. The dataset consists of several versions, including ml-100K, ml-1M, ml-10M, ml-20M, etc. In this experiment, we chose the dataset used in a previous study [31], namely ml-100K (MovieLens 100K). This ml-100K dataset contains several data files. Our study used 3 data files: rating data, item data, and user data.
The rating data consists of 100,000 ratings as rated by 943 users on 1682 movies. Each user has rated at least 20 movies. The rating values given by the users range from 1 to 5. A score of 1 expresses that the user really dislike the movie, while a score of 5 describes that the user really likes the movie. This rating data has a sparsity of 93.7% and a density of 6.3%. This rating data structure consists of user-id, movie-id, rating, and timestamp.
Item data contains information about items (movies). This item data structure comprises 24 attributes: movie-id, movie title, release date, video release date, IMDb URL, and 19 attributes of movie type (genre). Each item/movie can have several genres.
User data contains information about the user profile. This user data structure comprises five attributes: user-id, age, gender, occupation, and zip code (which describes the user’s location).
Data preparation
The second block is data preparation to perform data pre-processing by reducing irrelevant attributes. These irrelevant attributes are the timestamp of the rating data, movie title, release date, video release date, and IMDb URL of the item data.
MBCF process
The third block of the MBCF system is the MBCF process. The MBCF process contains two sub-blocks, namely the similarity calculation and the prediction.
The similarity calculation is the initial process used in the information filtering process using the MBCF approach. In this study, the similarity calculation consists of three components: \(S_r\) similarity calculation, \(S_b\) similarity calculation, and the UPCSim algorithm. Subsection “Similarity calculation” already details these three components.
The prediction is carried out to provide a predicted rating for items that had not been rated by active users. This prediction's initial stage is to determine the number (k) of the active user's nearest neighbors. k is an integer number representing the number of neighbors, ranging from 10 to 100 [2, 29,30,31]. After determining the k value , the next stage is to predict the ratings of unrated items.
The formula to predict the rating for an item (i) unrated by an active user (u) is shown in (9) [2, 16].
$$p_{ui}= {{\bar{r}}}_u+\frac{\sum _{v\in NNu} S(u,v).(r_{vi}-{{\bar{r}}}_v)}{\sum _{v\in NNu}|S(u,v)|}, v \ne u$$
(9)
\(p_{ui}\) represents the predicted rating value of user u to item i. \({{\bar{r}}}_u\) and \({{\bar{r}}}_v\) are the average ratings of user u and user v, respectively. \(r_{vi}\) is the rating value given by user v to item i, S(u, v) is the final similarity between user u and user v, and NNu is the set of nearest neighbors to user u.
Output
The fourth block of the MBCF system is the output block. This block evaluates the UPCSim Algorithm's performance in predicting ratings for items unrated by any active user.
The most prevalent MBCF system measures include Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), precision, and recall. Jalili et al. [36] classified the metrics for evaluating recommendation systems into two categories: prediction and classification metrics. The MAE and RMSE primarily evaluate prediction [37, 38], whereas precision and recall evaluate classification, for example evaluating top-N recommendations [3].
In this study, we adopt the MAE and RMSE to measure the UPCSim Algorithm's prediction. The MAE is the most widely used metric in recommendation systems using a collaborative filtering approach. It is used to estimate the average absolute deviation between the actual and the predicted rating values. A lower MAE provides good recommendation quality [39]. The formula for calculating MAE is defined in (10).
$$MAE=\frac{1}{TN} \sum _{u\in U, i\in I} |p_{ui}-r_{ui}|.$$
(10)
RMSE reflects the degree of deviation between the predicted rating and the actual rating. A lower RMSE is associated with highlevel prediction [40]. The RMSE formula is expressed in (11).
$$RMSE=\sqrt{\frac{1}{TN}\sum _{u\in U,i \in I}(p_{ui}-r_{ui})^2}$$
(11)
TN is the total number of predicted items. \(p_{ui}\) and \(r_{ui}\) represent the predicted rating and actual rating of the user u to item i, respectively.