 Research
 Open access
 Published:
An analytics model for TelecoVAS customers’ basket clustering using ensemble learning approach
Journal of Big Data volume 8, Article number: 36 (2021)
Abstract
ValueAdded Services at a Mobile Telecommunication company provide customers with a variety of services. Valueadded services generate significant revenue annually for telecommunication companies. Providing solutions that can provide customers of a telecommunication company with relevant and engaging services has become a major challenge in this field. Numerous methods have been proposed so far to analyze customer basket and provide related services. Although these methods have many applications, they still face difficulties in improving the accuracy of bids. This paper combines the XMeans algorithm, the ensemble learning system, and the NList structure to analyze the customer portfolio of a mobile telecommunication company and provide valueadded services. The XMeans algorithm is used to determine the optimal number of clusters and clustering of customers in a mobile telecommunication company. The ensemble learning algorithm is also used to assign categories to new Elder customers, and finally to the NList structure for customer basket analysis. By simulating the proposed method and comparing it with other methods including KNN, SVM, and deep neural networks, the accuracy improved to about 7%.
Introduction
ValueAdded Services are one of the important features and capabilities of mobile telecommunication companies that enable customers to receive services by paying to mobile telecommunication companies. These services can be very useful and effective in analyzing customer behavior [1, 2]. Customer Behavior [3] in many online systems are the activities that a customer does on a regular basis. These operations and transactions can be repeated in a few days [4]. Customer basket analysis is one of the most widely used data mining methods to analyze the goods in one or more baskets that the customer analyzes at a particular moment [5]. The basket analysis program can be designed and run in a supermarket not only because of the ability to help with sales promotional design but also because of the ability to become a reference for remanaging items in stock [6].
In recent years, customergenerated transactions are commonly used as information for analysis. This article also reviews or reexamines customer transactions to gain valuable information. For example, information about an item that is topselling. In addition, information can be used to add stocks to this sample. Also, transactions and customer performance can be used in the equation of every item present in the customer basket. Such information can also be used to present the right product to attract customers. One of the most important uses of these transactions is data analysis, transaction analysis and customer basket analysis [7].
Customer basket analysis is one of the modes of analysis based on customer behavior. However, shopping at the supermarket happens through identifying and making direct linkage among different items by the customer [2]. With regard to analyzing customer baskets as well as identifying items that are often purchased by them, there are challenges today that can be attributed to not recognizing customer behavior, product groups that products that are repeatedly purchases, and sales and product alignment. Using the customer basket analysis approach, we can identify items that are often purchased by customers at the same time and provide an opportunity to enhance the performance of the valueadded service system of telecommunication companies.
There are still challenges and difficulties in providing services in valueadded telecommunication systems, such as inadequate accuracy and high error of providing related services to the customers. Until now, there have been various methods for analyzing customers’ portfolio, such as the method of customer basket analysis based on transaction records [8], customer basket analysis approach by process category [9], portfolio analysis approach, customer acquisition with Apriori Algorithm [10], customer basket analysis approach using a combination of artificial intelligence techniques and associated laws and minimal spanning tree [11], Customer Basket Analysis Approach with the Advance System Business Strategy Forecast [12], Improving the approach of customer basket analysis in an efficient way called feasibility Utility Mining [13], is provided.
Most of the approaches presented have challenges and problems such as inadequate consideration of metric and factors related to customer behavior, inadequate quality of services provided to specific and related customers, inaccuracies in macro data analysis and so on. [11]. So, some of the most important motivations of this paper are as follows:

Presenting a model for improving the analysis of Customer baskets and achieving the best use of the combination of algorithms in this case.

Identifying the components and dimensions of the TelecoVAS Customer Basket model in Mobile Communication Company

Extracting features of VAS Customer Basket

Mobile communication companies and digital stores can use the obtained results to choose the appropriate solution to increase revenue, improve sales of services and products, and optimize their advertising and marketing.

Universities and other scientific and research centers; Once the results of this research are known, they can use them to conduct more specialized research and rely on them in conducting further research in this field and developing scientific theories.

Researchers and students; As the future makers of society, they should always have enough information about the analysis of the shopping cart of VAS subscribers in the mobile telecommunication company. Therefore, such research will help them to do other research.
In this paper, we used the NList algorithmbased technique [14] to analyze the customer basket and increase the accuracy of customer basket analysis using the proposed ensemble learning system. The proposed NList algorithm ensures that the comprehensiveness is maintained and the service execution speed is increased. The proposed ensemble learning system in this research consists of combining three machine learning algorithms including deep neural networks, C4.5 decision tree and SVMLib algorithm. (Library core of the support vector machine algorithm) [15]. The proposed ensemble learning system is based on maximum votes and sends the best response to the output at each step.
The main contributions of this research can be summarized as follows:

Combining KMeans and XMeans clustering algorithms and generating an efficient clustering algorithm to determine the initial grouping of data

Combining ensemble learning method and NList algorithm to analyze the TelecoVAS Customer Basket in Mobile Communication Company

Combining machine learning methods in the ensemble learning system and improving the results obtained with the help of NList algorithm
Therefore, by applying the Nlistbased mining process technique in the model proposed in this paper and by maximizing it, more precise rules can be extracted by analyzing the TelecoVAS Customer Basket in the mobile telecommunication company. Therefore, providing an efficient Nlistbased processbased modeling that can analyze customers’ baskets is one of the most important aspects of this research innovation.
The remainder of this paper is presented as follows: “Related works” section reviews the work done in the past, “Proposed method” section describes the proposed approach and architecture. In “Results” and “Discussion and future suggestions” section the results are obtained and the final conclusions are discussed.
Related work
Some research about Customer basket analysis has been carried out. This section will refer to some papers that discussed the Customer basket analysis and outline models and technics.
In 2020, Seyedan et al. presented a classification of these algorithms and their applications [16].
Chain management to predict time series, clustering, K nearest neighbors, Neural networks, regression analysis, support vector machines and support vectors. This paper is based on metaresearch demand forecasting in supply chains. Demand data features are expanding and dispersing today and Global supply chains use big data analysis and machine learning.
In 2020, Yudhistyra et al. proposed a method for implementing big data combining the CRISPDM framework and key steps for customer analysis. This paper aimed to discover meaningful patterns and ensure high quality of knowledge discovery from the big data available in a company [17].
In 2019, Jiang et al. proposed a new methodology for dynamic modeling of customer preferences on products based on their online reviews, which mainly focused in mining ideas from online reviews and using customer preferences to develop a dynamic model by using DENFIS approach. Unlike the conventional DENFIS approach that only provides crispy outputs in its modeling, the proposed DENFIS approach is capable of providing fuzzy outputs as well as crispy ones. By predicting fuzzy outputs, companies can face to the worstcase and the bestcase scenario of customer preferences while designing their new products, services [18].
In 2018, Musalem et al. presented a customer basket analysis model based on process categories. The basis of their work in this study was based on the similarity and distance between the existing samples, one of the most important benefits of their work being the speed of analysis of the customer’s basket. One of the major disadvantages of this model is the lack of proper accuracy for online portfolio analysis, the lack of comprehensiveness and the fact that the model does not perform well on large data sets. The performance range of the methodology proposed in this study includes the supermarket level and has a poor performance in a larger statistical population [9]. In 2018, Szymkowiak and his colleagues proposed an Apriori algorithm for customer basket analysis. The Apriori associative algorithm has an infinite constraint on the large statistical population. In their research, they were able to apply the data and items of a supermarket to achieve the desired accuracy. Therefore, one of the most important advantages of this model is that it has a good basket analysis speed, medium accuracy and one of the major disadvantages of this research is its lack of comprehensiveness and high flexibility [10].
In 2018, Jain et al. presented a customer basket analysis model with the help of a business strategy forecasting system. They carried out the process of analyzing the customer basket based on business logic and statistical business. They used statistical methods to make the service closer to the person concerned. One of the most important advantages of the method presented in this article was the accuracy of the service provided to the customers. In addition, the implementation time of the method proposed in this study was moderate but not comprehensible for large and large spaces [12].
In 2018, Srivastava et al. used a portfolio optimization model of customer shopping in an efficient way called utility mining. They proposed utility mining as an improved model of data mining. With the help of the technique provided, they were able to quickly and accurately perform the customer basket analysis process, but they did not have the potential for further development [13]. In 2017, Kurniawan et al. Presented a customer basket analysis model based on transaction records. In their research they used associative and data mining techniques such as neural networks and Apriori. One of the most important benefits of their work was the speed of basket analysis. However, one of the major disadvantages of this model was the lack of precision for online portfolio analysis, the lack of comprehensiveness and the fact that the model did not perform well on large data sets [8]. In 2016, Kaur et al. proposed a customer basket analysis model using a combination of data mining methods and association rules. In their methodology, they used data mining to improve the accuracy of customer basket analysis. They have also used data mining techniques such as neural networks and other machine learning techniques to teach based on purchase information and customer transactions. One of the most important advantages of their method was sufficient accuracy in analyzing the customer basket. The proposed analysis was very slow and the model was complex. It does not support a large statistical community and operates within the supermarket. The method proposed by them does not have the potential for future development [5]. In 2016, Venkatachari and his colleagues used a combination of associative approaches such as Apriori and FPGrowth to analyze customer baskets. Their proposed strategy was based on sharing repeated transactions. One of the advantages of their approach was the improvement of accuracy and consistency of customer basket analysis. However, one of the major disadvantages of their method is the increased runtime and lack of potential for development in a larger statistical population [19]. In 2015, Sherly et al. used parallel and distributed techniques and associative rules to analyze the customer basket. In their research, they sought to increase the speed, completeness, and accuracy of customer basket analysis. Eventually they could enhance the accuracy to some extent and significantly improved the speed and comprehensibility through parallelization [20]. From the analysis of the research that has been done so far, it can be seen that most research suffers from inadequate accuracy, low speed of cart analysis, inadequacy and so on. Thus, considering the problems in the models proposed in the context of valueadded customer basket analysis, this paper presents a processbased approach and algorithm for extracting iterative patterns such as NList. The value proposition system proposed in this article increases the accuracy of customer basket extraction and analysis. The process approach with the help of deep neural network algorithms, C4.5 decision tree and SVMLib algorithm significantly enhances the quality of valueadded services provided [15]. Reviewing the research that has been done so far, it can be seen that many of them suffer from inadequate accuracy, low speed of basket analysis, inadequacy, and so on. Thus, considering the problems in the models proposed in the context of valueadded customer basket analysis, this paper presents a processbased approach and algorithm for extracting iterative patterns such as NList [14]. The process approach significantly enhances the quality of valueadded services. Table 1 outlines the advantages and disadvantages of each method.
According to the review of the research conducted in the field of the TelecoVAS customer basket, it was found that the proposed methods face many challenges such as inaccuracy, recall, precision, and error rate. Therefore, in this paper we combined the XMeans algorithm, the ensemble learning system, and the NList structure to analyze the customer portfolio of a mobile telecommunication company and provide valueadded services. The XMeans algorithm is used to determine the optimal number of clusters and clustering of customers in a mobile telecommunication company. The ensemble learning algorithm is also used to assign categories to new customers, and finally to the NList structure for customer basket analysis.
The proposed method
The proposed method in this paper is based on XMeans clustering algorithms [21], NList structure [14] for extracting frequent patterns, and ensemble learning system to provide attractive value added services to telecommunication customers. This section describes the stages of service delivery using the proposed hybrid approach. Important parts of the proposed method are as follows:
Data preprocessing
Once the data enters into the proposed system, inappropriate samples must be distinguished and removed as part of a preprocessing phase to keep data consistency. There are some popular ways to apply preprocessing to the data, such as:

Clearing

Collecting

Transferring

Reducing
In this paper, we use data clearing method. In the proposed strategy, we check the data, and if the row or column contains null or unused values, the mean of the next and previous values will be calculated and replaced with the null ones. Data clearing eliminates outliers and produces more consistent data.
Data normalization
Data normalization is used to increase clustering accuracy. At the preprocessing stage, in order to obtain better results, we normalize the behavior information of the telecommunication customers between [0,1]. In other words, all datasets are mapped into matrices, and matrix rows are normalized. Normalization is done due to achieve higher accuracy [22]. To normalize the values of each dataset, we use (1).
where Xmax and Xmin are the maximum and minimum values in the range of my X property. After normalizing the data, the values of all the attributes fall within the range [0,1]. Min–Max normalization is easier than other normalization methods and performs the normalization process faster. For this reason, the Min–Max normalization algorithm is used in this paper.
Customers clustering using XKmeans algorithm
In this paper, we used a combination of KMeans [23] and XMeans algorithm together for clustering customer based on behavior information. The combination of KMeans and XMeans clustering algorithm is called XKMeans algorithm.
The second phase of this paper is of customers clustering. Customers may have two cases. The first case is a new customer who is active in the system. The other case is the one who is already registered in the system and has some activities. The XMeans clustering algorithm receives behavior information of customers. It then directs each customer to a cluster based on the behavior information. The XMeans algorithm is used to cluster the customer's information. One of the basic applications of using XMeans clustering algorithm in the proposed method is to apply cluster (labels) on customer's information that are unattended and do not have label properties. Figure 1, illustrates the application of the XMeans clustering algorithm in clustering each customer's information.
As can be seen from Fig. 1, all customers of the telecommunication company were first introduced to the XMeans algorithm in order to calculate the optimal K value using this algorithm. The XMeans algorithm runs in the background of powerful telecommunication servers. Because the XMeans algorithm is slow and has a high time complexity. After determining the number of optimal clusters (K), the KMeans algorithm [23] with the optimal K number is used for clustering.
The KMeans algorithm [23] is a basic clustering algorithm that performs the clustering process of samples based on a number of clusters called k. One of the most important disadvantage of the KMeans algorithm is that the number of clusters has to be determined by the researcher and the clustering process is done based on the determined number. Determination of k was highly errorfree and often did not provide optimal clustering. Unlike the KMeans algorithm, which has a high speed and receives a number of k from the input, this algorithm has a relatively low speed but instead obtains the optimal number of k and determines the clusters with the lowest error rate as the basic clusters.
It uses this number of clusters as input to the KMeans algorithm and performs clustering of the customer’s information. After the customer’s information is clustered, the outlier’s samples that behave similar to other samples are removed from the dataset. The KMeans algorithm steps is as follows:

1.
Select the number of k for the number of clusters.

2.
Then the k center for all data is randomly generated (µ_{1},…,µ_{k}).

3.
Then repeat the following steps until the convergence becomes complete:
Calculate c for each i.
For each j, calculate the value of µ as follows and j is the value.
After the clustering operation is completed, all customers fall into their respective clusters. In Fig. 2, the internal structure of the XMeans algorithm is visible.
As can be seen from Fig. 2, the initial k number is determined first. Then the KMeans clustering algorithm [23] is repeated with the same number k. The error rate is calculated and then one unit is added to the number of clusters and the previous steps are executed again. This procedure will continue until the best value of k is calculated.
This paper uses the XMeans clustering techniquean extended version of KMeans to assign labels to new customers. So, the input of the XMeans algorithm is the customers of the telecommunication company. The output of this algorithm is k. Finally, the number k is applied to the KMeans clustering algorithm. The input of the KMeans clustering algorithm is customers of the telecommunication company and labeling the customers is the output. Table 2 shows the sample of cluster and labeling customer's information.
These clusters are used as a label for each customer. Each Ci is labeled after the customers of the telecommunication company are clustered using the XKMeans algorithm. Up to this point a set of customers clustered with specific labels is available. So, we used XMeans algorithm for finding optimal k for KMeans clustering algorithm.
The ensemble learning
Figure 3 shows the ensemble learning flowchart for classifying the customers. Figure 1, implementation in first rectangle in Fig. 2, for clustering customer's information using XKMeans. At the core of the ensemble learning are the most popular classification algorithms such as deep neural network, the C4.5 decision tree with the Information Gain kernel and the SVMLib algorithm [15] for classifying new customers in mobile telecommunication companies. New customers are categorized based on their behavior information. Category assignment for new entrants allows more accurate valueadded services to be offered to customers based on services purchased by others. In the ensemble learning system, indepth learning with 50 hidden layers, the C4.5 decision tree is combined with the Information Gain core and the SVMLib algorithm, and at each stage the best batch is selected from the batches presented as the ultimate result for New customer specified.
The training data, which is 70% of the data, is entered into the algorithms and the corresponding model is generated. Experimental data are also entered into the models produced to determine a category based on behavior information. Consider Test 1: a 35yearold male customer who lives in X Province. This example is now entered into the Deep Learning Algorithm model and the category 1 is specified for the for Sample 1. Sample 1 also joins the decision tree algorithm and this algorithm specifies category 2 for sample 1. Finally, for example 1, the SVMLib algorithm [15] defines batch 1. Outputs 1, 2 and 1 are assigned to the Max system and based on the maximum votes, output 1 is determined for sample 1. In Category 1, for example, customers are between the ages of 20–30 and are male in X province. Thus the output of the ensemble learning system is as follows.
In the ensemble phase, a new category is selected for new customers. Customers in the target group behave similarly to other Customers. After the process search system was implemented and the new category was assigned to the new customers, the NList structure [14] was implemented on all customer baskets in the selected category, and finally, based on the analysis, a set of services are provided for the new customers.
Basket analysis using Nlist algorithm
One of the most important steps in this paper is to analyze the basket of customers interested in receiving valueadded services based on their behavior extraction and customer transaction records in the telecommunication system. In this study, the NList associative algorithm is used to analyze the customer cart [14]. Based on its tree structure, the NList algorithm processes customer transactions and offers customer services based on extracted repetitive rules and transactions. Based on repetitive transactions, a set of features that are effective in repetitive transactions are extracted and then used in the ensemble learning system. Suppose a database called DB has n transactions and these transactions have a number of items. For example, the following table shows a sample DB dataset with 6 transactions (n = 6).
This small data set is used to illustrate how the basket is analyzed in the proposed system. The value of Sup for the X pattern is represented as σ (X), where X ∈ I and I are the set of all items in the DB dataset and the number of transactions that contain all the items in X. A pattern with a kitem number is called a k pattern, and I1 is a set of duplicate patterns arranged in descending order. For convenience, the pattern {A, C, W} is written as ACW. Consider the minSup (minimum threshold sup) to a certain threshold. Suppose the DB dataset in Table 1 is minsup = 50%. AW and ACW are two of the most frequent patterns because δ (AW) = δ (ACW) = 4 > 50%. Now given this prerequisite, the structure of the NList algorithm to extract repetitive transactions is discussed.
In 2012, Deng and Xu introduced a tree structure called the PPC tree. In the PPC tree, each tree node has five values of n (N_i). f (N_i). child (N_i). pre (N_i). post (N_i) [14]. The NList algorithm or structure is based on the PPC tree. The NList structure has a set of nodes. Each node in the NList structure is represented as N_i. Each n_i node consists of a pp code. The pp code value of each N_i node in a PPC tree contains an instance of the form C_i = < pre (N_i). pre (N_i). f (N_i). The Nlist associated with pattern A is represented as NL (A). A set of PP codes from PPC tree nodes associated with pattern A. The value of NL (A) of pattern A is calculated based on relation (4).
where C_i is the PHP code for N_i support for A. The value of δ (A) is calculated as follows:
In the above relation, the NList is associated with kpatterns. Suppose XA and XB are two k1 patterns with the prefix X (can be an empty set) such that A exists before B in order I_1. If XA and XB are two repetitive patterns (XA is a repetitive pattern before XB and X can be an empty set). Then NL (XA) and NL (B) are the Nlists associated with XA and XB, respectively. Given the Nlist method with a NL (XB) ⊆ NL (XA) k pattern:
That \({C}_{i}\epsilon NL\left(XA\right)\) and \({C}_{i}\epsilon NL\left(XB\right)\) and \({C}_{i}\) Parent \({C}_{j}\) is.Therefore, \(\sigma \left(XAB\right)=\sum {C}_{i}\epsilon NL\left(XAB\right)f\left({C}_{i}\right)=\sum {C}_{i}\epsilon NL\left(XB\right)f\left({C}_{i}\right)=\sigma (XB)\) is. Figure 4 illustrates the creation of a PPC tree [25] using the DB example with %minSup = 50.
The NList algorithm first creates the PPC tree and then generates it to generate Nlists associated with the repetitive sets 1. Then, the divide and conquer strategy is employed to use PPC. In the following, for example, the NList structure implementation process is described in order to find frequent patterns.
Consider the DB dataset example in Table 3, with minSup = 50% to illustrate the performance of this algorithm. First, the NList algorithm removes all items that do not meet the minSup threshold frequency and arranges the remaining items in descending order result in Table 4. In Fig. 5, the algorithm then, in turn, imports the remaining items in each transaction into the PPC tree.
Figures 6 and 7 shows after executing the NList structure, a set of repetitive transactions is extracted. Therefore, using the NList structure, a set of valueadded services is offered to new customers, based on the basket in the category designated for new customers (Table 4).
Results
The proposed method is implemented using the MATLAB simulator version 2015a. The operating system is Windows 10 and of the 32bit type. 4 GB of RAM is used from which − 3.06 GB is usable, with 7core Intel processor (Core ™) i7 CPU)—Q 720 and processor base frequency of @ 1.60 GHz. Table 2 shows the settings and parameters of the network simulation. In this paper, we focused is on customers of the Iranian telecommunication industry. Trials and simulations have been carried out on 10,000 telecommunication contacts. In Table 5, simulation performed in a system shown.
Evaluation criteria
This section generally reviews the evaluation metric based on the unsupervised and supervised algorithms. In Eqs. (7 to 10) the methods of calculating the accuracy, precision, recall and classification error are shown.
In Eq. (4), TP (True Positive) represents transactions that are positive and classified as positive.TN (True Negative) represents the number of transactions that are negative and classified as positive. FP (False Positive) also indicates the number of transactions that were positive and classified as negative. Finally, FN (False Negative) shows transactions that are negative and classified as exactly negative. The equation to the validity and recall assessment is as follows.
Finally, the error rate is calculated by formula (10):
Validation indices are used to measure the goodness of clustering results to compare between different clustering methods or to compare the results of a single method with different parameters. Indicators for evaluating unsupervised learning techniques differ from those of supervised techniques. In this section, we introduce important indicators for credit evaluation based on internal and external validation indices.
Compactness, or CP, related to the inherent information of the dataset and is the first criterion for evaluating the goodness of the data separation based on the values and properties of the dataset. According to this criterion, data belonging to the same cluster should be as close to each other as possible. The common criterion for determining data density is data variance. So a good clustering creates clusters of samples that are similar to each other. More precisely, this index calculates the average distance between each data pair according to the relation 9. X is a dataset consisting of a stream of x_i. Ω is a set of x_i collected in a cluster. W is also a set of w_i that represents the center of Ω clusters. To measure the mean of the general index of compression in all clusters, we use the relation of 10 where k is the number of clusters obtained. Ideally, the members of each cluster should be as close as possible. Therefore, the lower the CP index, the better and higher the compression rate for clustering [26].

1.
Separation Index (SP), which specifies the degree of separation between clusters. This index measures the Euclidean Distance between centers of the cluster using the Eq. (10), where SP is close to zero indicating closeness between the clusters.
$$\overline{SP} = \frac{2}{{k^{2}  k}}\mathop \sum \limits_{i = 1}^{k} \mathop \sum \limits_{j = i + 1}^{k} \left\ {w_{i}  w_{j} } \right\_{2}$$(13)
The Davies–Bouldin evaluation, or DB: introduced by Davis and Bouldin, two scientists in electricity in 1979, is not dependent on the number of clusters or the clustering algorithm. This criterion uses the similarity between two clusters (R_ji), which is defined by the dispersion of a cluster \(\stackrel{}{{CP}_{i}}\) and the nonsimilarity between two clusters (d_ij). The similarity between the two clusters can be defined in different ways but must have the same equation conditions (14). The similarity of the two clusters is also measured using the relation (15) where the relations (16) measure d_ij.
According to the material outlined and the similarity between the two clusters defined, the Davis Bouldin index is defined as a relation of (17), where R_i is calculated as a relation of (18). A DB value close to zero indicates that the clusters are compact and are spaced apart.

2.
The Dunn Validity Index (DVI) is similar to the crossvalidation process used in supervised learning techniques (crossvalidation is a model evaluation method that determines how generalizable the results of a statistical analysis on a data set are and how it is independent of educational data.). It measures not only the degree of compression within the clusters but also the degree of dispersion between the clusters. Relation (19) defines this criterion.
$$DVI = \frac{{min_{0 < m \ne n < k} \left\{ {min _{{\begin{array}{*{20}c} {\forall x_{i} \in \Omega_{m} } \\ {\forall j \in \Omega_{n} } \\ \end{array} }} \left\{ {\left\ {x_{i}  x_{j} } \right\} \right\}} \right\}}}{{max_{0 < m \le k} max_{{\forall x_{i , } x_{j } \in \Omega_{m} }} \left\{ {\left\ {x_{i}  x_{j} } \right\} \right\}}}$$(19)
If the dataset contains separate clusters, the gap between the clusters is large (fraction) and its clusters (fraction denominator) are expected to be small. As a result, a larger value is more desirable for this criterion. The disadvantages of this criterion are time calculation and noise sensitivity (the diameter of the clusters can vary greatly if a noise data is available).
Dataset
In this paper, we have used 10,000 contact information of Iran Telecom contacts database. This database has 14 attributes. Table 6, shows the data features of the value added of Iranian telecommunication customers.
The following section describes each of the dataset fields:

msg_type: This property indicates the type of message.

mobile_no: This feature shows the shared mobile number.

txn_amount: This property shows the transaction amount.

pr_code: This property shows the process code.

rrn: This attribute indicates that the transaction was successful.

response: This property indicates the response time.

record_time: This feature shows the transaction registration time.

bank_id: This property shows the bank ID.

txn_type: This property indicates the type of transaction.

target_mobile: This feature shows the service code.

topup_type: This property indicates the type of product.

status: This feature shows the status of the transaction.

hour: This feature shows the transaction registration time.

capturedate: This property shows the transaction registration date.
Operating system used in this study Windows 7, operating system type also 32bit operating system, 4 GB RAM used—3.06 GB usable, Intel processor—Number of cores 7 (Core ™) i7 CPU)—Q 720 @ 1.60 GHz is 1.60 GHz.
Evaluation results
In this section, the results of accuracy, precision, recall and classification error of trusted customers for telecommunication company are analyzed by analyzing their basket using NList algorithm without this algorithm and combining it with ensemble learning core. In this paper, we combine three deep neural networks algorithms, C4.5 decision tree and SVMLib support vector machine in order to analyze the portfolio and customer classification. Each of these algorithms has the properties shown in the Tables 7, 8, 9. Table 5 shows the details of the deep neural networks algorithm for basket analysis and customer classification.
The table below shows the specifications of the C4.5 decision tree algorithm for analyzing the cart and customer classification.
The following table shows the specifications of the SVMLib algorithm to analyze the portfolio and customer classification.
Therefore, the simulations are performed according to the features of each algorithm in accordance with the tables above. As described in “Proposed method” section, the NList algorithm is used to select duplicate features. Repeatable features are those that are used by previous customers of Iran Broadcasting Company. After simulating the proposed method and implementing the NList algorithm, the properties are selected as the iterative features shown in Fig. 8.
As can be seen, the following properties have been extracted as NLIST algorithms:

msg_type attribute

mobile_no feature

txn_amount attribute

pr_code feature

Response feature

record_time feature

bank_id feature

txn_type attribute

target_mobile feature

Status feature
In other words, the effective features of the basket analysis by the NLIST algorithm are as follows (Table 10).
Finally, the proposed hybrid algorithm is applied to the basket with these features and the results are discussed in the next section.
Analysis of clustering results
Before describing the results of the proposed method for basket analysis, this section examines the results of implementing clustering methods. Some of the most important metric to prove the validity of the KMeans clustering algorithm are:

1.
CP: The higher this criterion is, the more favorable the clustering will be.

2.
SP: The lower this criterion is, the better the clustering will be.

3.
DB: The higher this criterion is, the more favorable the clustering will be.

4.
DVI: The higher this criterion is, the more favorable the clustering will be.
These metric are discussed in the paper Fahad et al. [26]. In order to prove the validity and desirability of the KMeans algorithm, the following section examines the mean of the metric derived from this algorithm with the other algorithms. In this paper, we implemented the Birch [27], EM [28], OptiGrid [29] and Denclue clustering algorithm [30] and compared the result of proposed method (XKMeans) with these clustering algorithms.
Table 11 shows a comparison of the compactness of the clustering methods with the KMeans algorithm.
The compression rate in Birch is 3.63, EM is 2.88, XKMeans (The proposed method) is 3.85, OptiGrid is 1.79 and Denclue is 1.35. KMeansbased algorithm outperforms other Birch, EM, OptiGrid and Denclue algorithms by 0.22, 0.97, 2.06 and 2.5, respectively. The second column shows a comparison of the validity index of the clustering methods.
The validation rate in Birch is 0.57, EM is 0.56, KMeans is 0.61, OptiGrid is 0.52 and Denclue is 0.501. Compared to other Birch, EM, OptiGrid and Denclue algorithms, the KMeans algorithm is 0.04, 0.05, 0.09 and 0.11, respectively. The third column shows the comparison of the DavisBouldin clustering methods. The DB in Birch is 5.78, EM is 5.37, KMeans is 6.103, OptiGrid is 4.27 and Denclue is 4.29. Compared to other Birch, EM, OptiGrid and Denclue algorithms, the KMeans algorithm is 0.32, 0.73, 1.83, 1.81, respectively. The fourth column shows a comparison of the separation rates of the clustering methods. The separation index value in Birch is 4.11, EM is 3.28, KMeans is 2.04, OptiGrid is 2.16 and Denclue is 1.74. The improvement rate of KMeans based algorithm compared to Birch algorithms, EM algorithm, OptiGrid algorithm is 2.07, and 0.12, respectively, and it performed worse than Denclue algorithm.
Basket analysis results without the NLIST algorithm
This section analyzes the results of ensemble learning approaches such as deep neural networks, decision tree C4.5 and SVMLib in the form of an ensemble learning system. It should be noted that the NLIST optimization algorithm can have a significant impact on improving the accuracy of the basket analysis. Therefore, in order to clarify and prove the effectiveness of the NList algorithm in this section, we first analyze the results without this algorithm, then analyze the results obtained with this algorithm.
Calculating the accuracy, precision, recall and error analysis of the basket is one of the most important parameters that can prove the accuracy of the proposed method. Hence, Table 12, shows the TP, TN, FP and FN results of the SVMLib, C4.5 and deep neural network algorithm for basket analysis with and without the NLIST algorithm.
Table 12 shows the number of correct and incorrect classifications. Based on these variables, the criteria of accuracy, accuracy, recall, and error are calculated, which are described below. Table 13, shows the accuracy, precision, recall and error rate of the SVMLib, C4.5 and deep neural network algorithm for basket analysis without the NLIST algorithm.
As can be seen in Table 13, the accuracy of the ensemble learning algorithm without applying the NList is 90.6%. The average improvement of classification accuracy and basket analysis in the proposed method is 12.38% compared to that of other algorithms. Also the precision, recall and error rate of the proposed method improved about 0.17%, 12.23% and 12.38% compared to other algorithms.
As can be seen from the Table 14, the accuracy of the ensemble learning algorithm with applying the NList is 97.6%. The average improvement of classification accuracy and basket analysis in the proposed method is 14.23% compared to that of other algorithms. Also the precision, recall and error rate of the proposed method improved about 0.66%, 14.61% and 14.23% compared to other algorithms.
Discussion and future suggestions
The valueadded services can make a great profit for telecommunication companies. Some customers pay for VAS to enjoy the services. In this paper, using unsupervised machine learning algorithms such as KMeans, ensemble learning algorithms consisting of a combination of deep neural networks algorithms, SVMLib, and C4.5 decision tree, as well as the NList algorithm, customer's basket is analyzed and customers who can have more profit for telecommunication companies are classified. By simulating the proposed method, it was observed that the use of NList technique to extract repetitive features has a significant effect on customer basket analysis. So, one of the most important applications of the NList algorithm in the proposed method is the extraction of repetitive rules. Extracting repetitive rules makes it possible to identify the top features in telecommunication services and use those features to analyze the subscriber portfolio.
The most important advantages of this article for other individuals and organizations are:

Analyzing the TelecoVAS customer basket and providing attractive services for the targeted customers

Increasing revenues from the TelecoVAS customer basket of the telecommunication company by finding and providing more accurate services related to customer needs than before.

Increasing the number of users of the telecommunication company by offering services related to users' tastes.

Decreasing costs of ads broadcasting and replacing it with the targeted ads based on the customer’s taste.

Customer grouping based on their tastes and needs. It is not only for TelecoVAS but also can be used for any dataset in digital markets.

Extracting repetitive rules and processes in customers baskets and telecommunication company’s focus on extraction rules.
Future research suggestions include the use of reinforcement learning methods and deep neural network LSTM, GMDH instead of mass learning system and also the use of other clustering algorithms such as DBScan instead of KMeans clustering algorithm in the clustering process.
Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available due [REASON WHY DATA ARE NOT PUBLIC] but are available from the corresponding author on reasonable request.
References
Gb J, Maran K. Influence of the Value Added Services (VAS) consumer decision with the brand names. Int J Sup Chain Mgt. 2018;7(1):137.
Olya H, Altinay L, De Vita G. An exploratory study of value added services. J Serv Mark. 2018;32:334–45.
Chen MC, Chiu AL, Chang HH. Mining changes in customer behavior in retail marketing. Expert Syst Appl. 2005;28(4):773–81.
Liu J, Gu Y, Kamijo S. Customer behavior classification using surveillance camera for marketing. Multimed Tools Appl. 2017;76(5):6595–622.
Kaur M, Kang S. Market Basket Analysis: identify the changing trends of market data using association rule mining. Procedia Comput Sci. 2016;85:78–85. https://doi.org/10.1016/j.procs.2016.05.180.
Mansur A, Kuncoro T. Product inventory predictions at small medium enterprise using market basket analysis approachneural networks. Procedia Econ Financ. 2012;4:312–20.
Haghighatnia S, Abdolvand N, Rajaee HS. Evaluating discounts as a dimension of customer behavior analysis. J Mark Commun. 2018;24(4):321–36.
Kurniawan F, Umayah B, Hammad J, Nugroho SM, Hariadi M. Market Basket Analysis to identify customer behaviours by way of transaction data. Knowl Eng Data Sci. 2018;1(1):20.
Musalem A, Aburto L, Bosch M. Market basket analysis insights to support category management. Eur J Mark. 2018. https://doi.org/10.1108/EJM0620170367.
Szymkowiak M, Klimanek T, Józefowski T. Applying market basket analysis to official statistical data. Econometrics. 2018;22(1):39–57.
Valle MA, Ruz GA, Morrás R. Market basket analysis: complementing association rules with minimum spanning trees. Expert Syst Appl. 2018;97:146–62.
Jain S, Sharma NK, Gupta S, Doohan N. Business strategy prediction system for market basket analysis. In: Kapur P, Kumar U, Verma A, editors. Quality, IT and business operations. Springer proceedings in business and economics. Singapore: Springer; 2018. p. 93–106.
Srivastava N, Stuti, Gupta K, Baliyan N. Improved market basket analysis with utility mining. In: Proceedings of 3rd international conference on internet of things and connected technologies (ICIoTCT); 2018. p. 26–7.
Deng Z, Wang Z, Jiang J. A new algorithm for fast mining frequent itemsets using Nlists. Sci China Inf Sci. 2012;55(9):2008–30.
Abdiansah A, Wardoyo R. Time complexity analysis of support vector machines (SVM) in LibSVM. Int J Comput Appl. 2015;128(3):28–34.
Seyedan M, Mafakheri F. Predictive big data analytics for supply chain demand forecasting: methods, applications, and research opportunities. J Big Data. 2020;7(1):1–22.
Yudhistyra WI, Risal EM, Raungratanaamporn IS, Ratanavaraha V. Using big data analytics for decision making: analyzing customer behavior using association rule mining in a gold, silver, and precious metal trading company in Indonesia. Int J Data Sci. 2020;1(2):57–71.
Jiang H, Kwong CK, Kremer GO, Park WY. Dynamic modelling of customer preferences for product design using DENFIS and opinion mining. Adv Eng Inform. 2019;42:100969.
Venkatachari K, Chandrasekaran ID. Market basket analysis using fp growth and apriori algorithm: a case study of mumbai retail store. BVIMSR’s J Manag Res. 2016;8(1):56–63.
Sherly KK, Nedunchezhian R. A improved incremental and interactive frequent pattern mining techniques for market basket analysis and fraud detection in distributed and parallel systems. Indian J Sci Technol. 2015;8(18):1–12.
Pelleg D, Moore AW. Xmeans: Extending kmeans with efficient estimation of the number of clusters, vol. 1. InIcml; 2000. p. 727–34.
Kiran A, Vasumathi D. Data mining: min–max normalization based data perturbation technique for privacy preservation. In: Proceedings of the third international conference on computational intelligence and informatics. Singapore: Springer; 2020. p. 723–34.
Likas A, Vlassis N. The global kmeans clustering algorithm. Pattern Recognit. 2003;36(2):451–61.
Ossama O, Mokhtar HMO, ElSharkawi ME. An extended kmeans technique for clustering moving objects. Egypt Inf J. 2011;12(1):45–51.
Le T, Vo B. An Nlistbased algorithm for mining frequent closed patterns. Expert Syst Appl. 2015;42(19):6648–57.
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, et al. IEEE transactions on a survey of clustering algorithms for big data : taxonomy and empirical analysis. IEEE Trans Emerg Top Comput. 2014;2(3):267–79.
Lorbeer B, Kosareva A, Deva B, Softić D, Ruppel P, Küpper A. Variations on the clustering algorithm BIRCH. Big Data Res. 2018;11:44–53.
Do CB, Batzoglou S. What is the expectation maximization algorithm? Nat Biotechnol. 2008;26(8):897–9.
Hinneburg A, Keim DA. Optimal gridclustering: towards breaking the curse of dimensionality in highdimensional clustering; 1999.
Rehioui H, Idrissi A, Abourezq M, Zegrari F. DENCLUEIM: a new approach for big data clustering. Procedia Comput Sci. 2016;83:560–7.
Acknowledgement
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
All authors contributed to developing the ideas, and writing and reviewing this manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Vahidi Farashah, M., Etebarian, A., Azmi, R. et al. An analytics model for TelecoVAS customers’ basket clustering using ensemble learning approach. J Big Data 8, 36 (2021). https://doi.org/10.1186/s40537021004211
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537021004211