A comparative dimensionality reduction study in telecom customer segmentation using deep learning and PCA

Telecom Companies logs customer’s actions which generate a huge amount of data that can bring important findings related to customer’s behavior and needs. The main characteristics of such data are the large number of features and the high sparsity that impose challenges to the analytics steps. This paper aims to explore dimensionality reduction on a real telecom dataset and evaluate customers’ clustering in reduced and latent space, compared to original space in order to achieve better quality clustering results. The original dataset contains 220 features that belonging to 100,000 customers. However, dimensionality reduction is an important data preprocessing step in the data mining process specially with the presence of curse of dimensionality. In particular, the aim of data reduction techniques is to filter out irrelevant features and noisy data samples. To reduce the high dimensional data, we projected it down to a subspace using well known Principal Component Analysis (PCA) decomposition and a novel approach based on Autoencoder Neural Network, performing in this way dimensionality reduction of original data. Then K-Means Clustering is applied on both-original and reduced data set. Different internal measures were performed to evaluate clustering for different numbers of dimensions and then we evaluated how the reduction method impacts the clustering task.

Clustering high dimensional data Data dimensionality in statistics refers to how many attributes a dataset has. Big data is one of the most common topics in present studies, and its techniques are applied in different fields such as telecom industry [1] to support strategic decisions. Clustering involves the grouping of similar objects into a set known as cluster. Objects in one cluster are likely to be different when compared to objects grouped under another cluster, and it's one of the most fundamental processes for analyzing unsupervised data, which is applied in a wide range of applications such as computer vision [2][3][4], natural language processing [5][6][7] and bioinformatics [8,9]. Many algorithms have been suggested over the previous eras to address clustering issues [10,11]. There are two famous featurebased techniques among the current algorithms, K-means [12] and Gaussian Mixture Models (GMM) [13]. K-means makes it difficult to cluster each sample to its closest cluster center. A Gaussian mixture model is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian distributions with unknown parameters and attempts to infer distribution of the data.
Learning from large amount of data is a very challenging issue that faces most of feature-based clustering and other data analysis techniques, as data analysis becomes more difficult due to the "curse of dimensionality" [14]. So, high-dimensional data processing requires using proper techniques and methods [15]. The objective of dimensionality reduction methods is to learn an appropriate and simplified data representation from the original dataset [16][17][18] in order to get more insight from huge data. While many approaches have been presented to handle the challenges of reducing dimensionality and clustering, standard clustering techniques generally have bad efficiency on high-dimensional data because of the inefficiency of similarity measures used in them. In addition, using these techniques with large-scale datasets usually suffer from high computational complexity. For this purpose, techniques for reducing dimensionality and transforming features have been widely researched to map the raw information into a new feature space where clustering algorithms and data analysis techniques can be applied on representative features instead of the entire feature space. Generally, current techniques of data transformation include linear transformation such as Principal Component Analysis (PCA) [19] and non-linear conversion such as kernel methods [20], spectral methods [11] and deep learning-based techniques which have shown an impressive result in eliminating irrelevant and redundant information. Encouraged by a serious interest in deep learning, many researchers use autoencoders [21] and clustering approaches to address the objective of data representation and clustering.

Deep Autoencoder: challenges and issues
Deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervisedly from unlabeled data by working with multilevel learning of data representations [22] Recently, Deep learning has been applied successfully in many fields, such as image classification and speech recognition [23] because these deep architectures can automatically learn features from images, sound or text data. Deep neural networks (DNNs) can be used to transform the raw data into more cluster-friendly representation through high-level non-linear mapping [24]. Due to the advancement of deep learning [25] including autoencoder (AE) algorithm and its deep version (DAE), deep embedding clustering (DEC) [26], followed by other novel methods [27,28] are proposed. Which use deep representation to improve clustering performance. Autoencoders [21,29] are considered an unsupervised learning technique and a specific form of feedforward neural networks where the input is the same as the output. They compress the input into a lower-dimensional code and then reconstruct the output from this representation. The code is a compact "compression" of the input, also called the latent-space representation, the training process is still based on the optimization of a reconstruction error or loss function. An autoencoder consists of three components: encoder, code and decoder. The encoder compresses the input and produces the code, the decoder then reconstructs the input only using this code as shown in Fig. 1. An autoencoder can learn a representation or encodes the input features for the purpose of dimensionality reduction. If linear activations or a single hidden layer of sigmoid are used, then the ideal solution for an autoencoder is heavily linked to Principal Component Analysis (PCA). Autoencoders can learn data projections with suitable dimensionality and sparsity limitations that are more useful than other fundamental methods such as PCA which only allow linear data transformation. on the other hand, autoencoders are non-linear and can learn more complicated relations between visible and hidden units. They can also be stacked, which makes them even more powerful. A number of works have recently covered clustering approaches that combine classical clustering algorithms with techniques of deep learning. These methods either follow a sequential approach where a deep representation is learned using a deep autoencoder [7,[30][31][32][33][34] before clusters are obtained using clustering method (e.g. k-means). or a simultaneous approach which use deep neural networks to simultaneously learn features representations and cluster assignments by optimizing a single objective function [35,36]. Both approaches enhance efficiency in clustering. However, when dealing with real-world data, current clustering algorithms based on deep autoencoders have major drawbacks that restrict their reliability and ease of use, such as, initialization of the weights that is initializing the weights with random values clearly adds randomness to the obtained results [37] and the network architecture (or structure) (i.e. number and width of layers) which forces the network to obtain a different representation of the data while keeping important information. However, most of the recent researches on deep clustering [35,36,[38][39][40] proposed a different structure for each studied dataset and the clustering efficiency of the techniques proposed strongly depends on a specific DAE structure.

The contributions of our work and structure
The goal of this paper is to perform feature selection and high dimensional data reduction with different techniques to examine their performance in telecom customer segmentation as case study.
More in detail, three scenarios will be compared, which are the data set without performing any feature reduction, the data set preprocessed by feature reduction using PCA only, and the data set preprocessed by data reduction using Autoencoders to perform representation learning which enhances the customer segmentation model in terms of both accuracy and running time or mitigates constraints evaluated on high dimensional data. In this case study, we use sequential clustering approach starting with applying data reduction using a fully convolutional autoencoder to learn clustering-friendly representations of the data and it is trained with the standard mean squared error then perform clustering on the represented dataset then comparing the results with standard clustering after reducing the dimensionality linearly with PCA. The rest of this paper is structured as follows: the related work is presented in "Related work" section. "Research methodology" section we review our dataset for which no classical methods are well adapted and suggest our research methodology including the data transformation, correlation analysis and the use of Deep Clustering algorithm along with its results. In "Results and discussion" section, we discuss the evaluation of the suggested techniques and compare the efficiency of sequential and simultaneous clustering methods in our case, whereas the paper is concluded in "Conclusion and future work" section along with shedding light on future perspectives.

Related work
Companies using customer segmentation technique because each client is distinct and their marketing efforts would be better by building an efficient marketing and business strategies. Companies also hope to obtain a deeper understanding of the preferences and needs of their customers, with the concept of finding out what each segment finds most useful to tailor marketing products to that segment more correctly [41]. More recently, many techniques of customer segmentation in telecom sector have been proposed because of the increased amounts of data passing through various communication channels that make old techniques and methods no longer applicable. However, several researches in high dimensional data spaces have been studied. We highlight these works in this section briefly and review the clustering of high-dimensional data.
A Vodafone Case Study by Jansen proposed a customer segmentation and profiling model based on call usage behavior [42]. Twelve features are used as a summary description of customers and to build customer segments based on their incoming and outgoing calls over the time period P such as (average call duration, average calls received per day, average calls received per day, daytime calls… etc.). Each segment describes a certain behavior of customer's group. A customer's segment was estimated based on the customer's profile using the data mining technique called Support Vector Machines. Two solutions have resulted from customer segmentation, the first is four segments and the second is six segments. With the Support Vector Machine approach, 80.3% of the cases classified the segment of a customer correctly, based on its profile for the situation with four segments. With six segments, the correct classification of 78.5% is obtained. Aheleroff et al. [43] used data set containing Call Detail Record (CDRs) of a mobile operator, located in the Middle East that has about 35 million mobile subscribers to categorize the subscribers into four loyal groups based on the level of customer's loyalty during their life cycle. The usage patterns are obtained from detailed call/event records based on mobile services such as GPRS/MMS-SMS, including main usage information such as (incoming call number, outgoing call number, call time and date, call duration, call type, location…) which selected as the core items to monitor customer's behavior. Then they Applied k-means as a well-known segmentation algorithm. The suggested customer life cycle model considered the past contribution, potential value, and churn probability at the same time. After identification of subscriber's behavior and loyal group, the output customers' clusters are used to make an applicable strategic plan for each group to improve customer satisfaction. While Masood, Salar, et al. [44] used a real customer data of a telecommunications company of Pakistan to apply a two-step clustering algorithm based on the call usage, revenue and recharge analysis, they made five revenue segments, that each one was further segmented based on user's call usage data. The customers were segmented based on their revenue attributes such as Final Revenue, Calls Revenue, SMS Revenue and Value Added Services (VAS) Revenue. The number of clusters was manually selected as five, because the value for silhouette measure of clusters (which indicates the quality of clustering result as poor, fair or good) found to be more than 0.5 which shows that good clustering has been done by the algorithm. The two-step clustering algorithm was applied, first step all the cases are assigned to pre-clusters, During the second step, these pre-clusters are handled as individual cases and are further clustered using the hierarchical clustering algorithm, the results were thoroughly analyzed. It was shown that VAS usage was greater than the SMS usage for all customers' segments. Every cluster was analyzed to uncover its call as well as SMS usage behavior. Most of the previous studies built their models based on a few number of features. Classical approaches for reducing dimensionality, such as Principal Component Analysis (PCA), were widely considered for the task of embedding. The linear nature of such methods, however, makes it difficult to infer true depictions of real-world data, typically based on extremely non-linear manifolds. This motivates research into profound learning models (e.g. auto-encoders, convolutional neural networks) that have been shown to be effective in extracting highly non-linear characteristics from complex data such as text, images, or graphs [21,22,29]. Deep autoencoders (DAE) have been shown to be helpful in reducing dimensionality [21] and denotation images. The autoencoders (AE) in specific can represent data into a new space in a non-linear manner. Related to clustering, the general idea is to represent the data into a low dimensional space by encoding only the most relevant information that characterizes the original data, reducing noise and sparsity, and then perform clustering in this new space. Recently, several notable works have mixed embedding learning and clustering. The suggested techniques usually perform both clustering and deep embedding in two different ways. First, some works suggested a sequential combination of deep embedding and clustering. In [30] the researchers use a stacked autoencoder to learn a representation of the affinity graph and then apply K-means to get the clusters on the learned representations. More lately, [45] suggest to include an autoencoder within the context of Deep Embedded Clustering (DEC) [26]. Then, together with local structure conservation, the suggested framework can perform clustering and learn representative characteristics. Additionally, a new technique to perform non-linear reconstruction has been suggested in [46] to adopt deep neural networks for representation-based community detection. Several researches recommend performing both the embedding process and the clustering task jointly in order to allow clustering to control the features extraction and vice versa. As an example in [36] a general framework, known as DeepCluster was proposed, to combine traditional clustering methods [K-means and Gaussian Mixture Model (GMM)] into deep learning models and adopt Alternating Direction of Multiplier Method to optimize it, the Performance was compared on the three real datasets (MNIST, USPS and Reuters10K), The highest values of all metrics on the datasets were belonged to the proposed DeepCluster framework.

Research methodology
This section describes the data set used in this work including features extraction as well as steps of clustering high dimensional data process presented in Fig. 2.

Input data and preparation
Our segmentation model inputs are based on mobile users' actions or events over Syrian society. Four main sources of data are available in more detail: Call Detail Records (CDR), Database of Cell Towers, Customer's services and Customers' contract information.

Features extraction
The accuracy of segmentation relies mainly on the selection of features and the technique of segmentation. For customer segmentation problem, the features mainly come from the data gathered in the enterprise database, after being extracted from various resources. However, feature extraction [47] is the process of transforming the input data into a set of features which can very well represent the input data. The feature extraction process is done depending on our research goal using in memoryprocessing tools for processing and analytical purposes. The extracted features are categorized as follows: (some features may belong to more than one category): 1. The behavioral features which could be classified into individual, spatial, and social network features as follows: • Individual features describe the mobile phone usage and interactions of an individual with contacts. (e.g. average call duration per day, standard deviation of received SMS per day at worktime, the response rate of the user). Individual behavioral features are extracted from Call Detail Records (CDRs) generated by the Syrian Telecom cellular network. CDRs log the users' activities for billing purposes and network management. • Spatial features describe the mobility patterns of an individual during the usage of mobile phone (e.g. entropy of visited antennas, average of mobility, daytime antenna entropy). Spatial and mobility features are extracted based on the database of cell towers and their locations as Table 1. • Social network features describe the social network of an individual and compare their behavior with that of their peers. (e.g. clustering coefficient, assortativity).
The extracted behavioral features are about (200) features related to the previous three categories and the rest of features are extracted from another sources. Table 2 shows a sample of extracted behavioral features.

2.
Customer's subscribed services All services that the client enrolled in have been gathered and categorized manually on the basis of the type of service, such as services related to s bundles subscription, etc…, these categories are handled as customer features. As a result we got Table 3 of customer's service. 3 Customer's demographic data reflect some basic states of customer. Such as customer's name, address, age, sex and details of client subscriptions as number (two or more GSMs) and types of subscriptions: pre-paid, post-paid, 3G, 4G… are extracted from the customers' contract data and considered as additional features. We build and train the model using records related to 100,000 subscribers within about 6 months, with over 220 different features derived from CDRs, subscription information and registered services for each customer in our case study. Our goal is to take benefit of all features by reducing the dimensions, in order to filter out irrelevant features and noisy data samples.

The experimental process
Data extracted from the source location is often not usable in its original form. To overcome this obstacle, raw data must be transformed into a useful and efficient format, in our case study data is cleansed first before it can be transformed as follows: this work includes different stages (data exploration, features extraction and selection, and model validation), however different machine learning methodologies is used: • Data cleansing, Data is cleansed through processes such as filling in or removing missing values, smoothing the noisy data, or resolving the inconsistencies in the data. • Data transformation, standardization is the key pre-processing step in data mining to standardize features or attributes values from different dynamic ranges to a specific range. Applying standardization step leads to obtain better quality, efficient and accurate clustering results. It is also important to select a specific standardization procedure, according to the nature of the datasets for the analysis [48]. In our case, the result obtained by the z-score standardization method where it is more efficient than other standardization methods. Min-Max scaling is typically done via the following formula: In Z score normalization, we perform the following formula:  where is the mean of the population, is the standard deviation of the population. Min-Max scales the values closer to mean. But when data outliers are important and we don't want to lose their impact, we attempt to scale with Z score normalization.
• Dimensionality reduction, dimensionality is the number of characteristics, variables, or features that describe the dataset. These dimensions are represented as columns and the goal is to reduce their number. In most cases, those columns are correlated and therefore, there is some information that is redundant which increases the dataset's noise. This redundant data has a negative impact on the results of training our model and therefore it's important to use dimensionality reduction techniques such as Feature Selection and Feature Extraction as a very useful way to reduce the model's complexity and avoid overfitting. In our case study, we applied three different scenarios which lead to different clustering models, shown in Fig. 3, described as follows: 1. The baseline this process is to perform clustering on the original data set without performing any data reduction method. Therefore, the comparison between the results allows us to understand whether the reduced data sets can make the clustering accuracy better than the result without performing data reduction.

(PCA-based) customer segmentation
First, the PCA transformation is applied to the original data. Data is transformed into new features as a combination of the original features in a new space, compressed in a way that the most relevant information is retained. Then, k-means clustering algorithm was applied on the transformed dataset. However, applying PCA before clustering is a powerful method for clustering high dimensional data in the presence of the linear correlations in features. So, we would like to know which feature reduction method can enhance the clustering accuracy.
Step1: PCA implementation Using unlabeled training dataset, we actually have a data matrix X of 100,000 × 220. The matrix representation of the information not only makes the variables difficult to visualize and interpret but also computationally taxable when perform clustering. Therefore, we apply PCA to reduce the dimensionality of the data by finding projection direction(s) that minimizes the squared errors in reconstructing the original data or equivalently maximizes the variance of the projected data. Its performance depends on the distribution of the data set and the correlation of features. PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by some projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. The first step of PCA is to standardize the dataset matrix. That's important because features or columns with larger scales compared to other columns will ultimately dominate the final principal component matrix, i.e. attributing most of the explained variance to these features. This step is performed by Standard Scaler which subtracting each column mean from each column and dividing by their respective standard deviation. However, standardization has other advantages, such as faster learning for neural networks, which will later become important when autoencoder is implemented. The second step is to calculate the covariance matrix of the standardized matrix. The covariance between the variables can be found by performing matrix multiplication between the standardized matrix and its transposition, the result matrix should be a 220 × 220 symmetric matrix. where each diagonal element is the variance of each feature and every non-diagonal element is a particular co-variance between two features. Therefore, the covariance matrix, will have the following structure: The third step is to discover the covariance matrix's eigenvectors using eigenvalue decomposition to get a matrix whose columns are Cov-Matrix's eigenvectors. The eigenvectors matrix has columns that representing the principal components (the new dimensions), each component is orthogonal to each other and arranged in decreasing order of explained variance. The last step is to use the transformation matrix (eigenvectors matrix), by multiplying it with the scaled data matrix to compute the principal component scores in order to find the optimal number of components which capture the greatest amount of variance in the data. In our analysis, the applied method to determine the number of principal components is to look at a Scree Plot, which is the plot of the principal components ordered from largest to the smallest. The number of components is determined at the variance drop-off point where the gain in explained variances first decreased significantly, beyond which the remaining eigenvalues are all relatively small and of comparable size [49]. As shown in Fig. 4, a Scree plot with bars displaying the explained variance as a percentage of total variance by the first 20 PCs such that 90% of the variance is retained and showing variance drop-off after the third component account for roughly 28% of the total explained variance. We found that, the first three principal components explain 72% of the variation. This is an acceptably large percentage. Assume that the transformation matrix as TM-L, when reducing the dimensionality of our dataset, where L is the number of principal components such that L < 220 . We can then compress the data using TM-L, where TM-L still has the same number of rows but only L columns, leading to a reduced dataset on which the clustering Algorithm is applied in an unsupervised manner, since we don't have labels, but it's important to reveal the contributions of each feature on the components and then proceed to the last step in the process.  (10,20,30,50) that maintain the largest percentage of variance in the data and give a total variance ratio greater than 80%, we evaluated the projection loss by calculating the MSE loss function which (expresses the difference between the reconstructed data from the projected data and the original data), obtained in Table 4.

Cov-Matrix
Step 2: Reveal the most important variables/features after a PCA analysis, The important features are those that influence the Principle components more, and therefore have a high absolute value score on them. As a way to get how components are correlated with the original features we retrieve the indexes of the most important features on each component, the large value of the contribution means that the variable contributes more to the component and it is crucial in explaining the variety in the data set. Variables that don't correlate with any component or correlated with the last dimensions are variables with low contribution and might be eliminated to simplify the overall analysis. The importance of each feature is reflected by the magnitude of the corresponding values in the eigenvectors. First, we get the amount of variance that each PC explain using (pca explained variance ratio) then we find the most important features. To sum up, we look at the absolute values of the Eigenvectors' components corresponding to the k largest Eigenvalues. In sklearn the components are sorted by explained variance. The larger they are these absolute values, the more a specific feature contributes to that principal component. Table 5 describes the variables according to their contributions to the principal components expressed in percentage. So, for each principal component we can see which variables contribute most to that component as Table 6.
Step 3: Find the Clusters, K-Means clustering algorithm is applied in this step to cluster the reduced dataset represented by top 20 PCs as new feature space.  First, we applied the elbow method to determine the optimal number of clusters for k-means clustering.. for PCA, the k-means scree plot below in Fig. 5 The Elbow Method is one of the most popular methods to determine this optimal value of k. The idea of the elbow method is to run k-means clustering on the dataset for a range of values of k (say, k from 1 to 15 in our case), and for each value of k calculate the sum of squared errors (SSE). Then, plot a line chart of the SSE for each value of k. If the line chart looks like an arm, then the "elbow" on the arm is the value of k that is the best. The idea is that we want a small SSE, but that the SSE tends to decrease toward 0 as we increase k (the SSE is 0 when k is equal to the number of data points in the dataset, because then each data point is its own cluster, and there is no error between it and the center of its cluster). So our goal is to choose a small value of k that still has a low SSE, and the elbow usually represents where we start to have diminishing returns by increasing k. Figure 5 shows An elbow chart showing the SSE after running k-means clustering for k going from 1 to 15. We see a pretty clear elbow at k = 4, indicating that 4 is the best number of clusters. 3 (AutoEncoder-based) Customer Segmentation In addition to examine the model performance by linear transformation using PCA, dimensionality reduction based on Artificial Neural Network (ANN) is also considered. The result can be used to compare the performance of the autoencoder and PCA for the data reduction tasks. Autoencoders are generally a type of artificial neural network used for learning a representation or encoding of a set of unlabeled data as a first step towards reducing dimensionality by compressing (encoding) the input data matrix to a reduced set of dimensions and then reconstructing (decoding) the compressed data back to their original form. then the compressed data may be used in different applications such as clustering as our case. An autoencoder can learn nonlinear transformations using a nonlinear activation function and multiple hidden layers by taking an input layer (data matrix X), which is passed through hidden layers where their dimensions are lower than the input layer dimension to perform compression. The network aims to reconstruct the original data as close as possible to the original with minimum reconstruction error. Autoencoder training algorithm may be summarized in the following steps, for each input x:    Step 1: Autoencoder Implementation For our experiments, we trained fully connected autoencoder as a symmetric model that is symmetrical about how the input data is compressed and decompressed by exact opposite manners.
Before start training an autoencoder, it is essential to set our hyperparameters set including the model design components and training variables such as: We configure the Autoencoders using the Keras tensorflow with GPU support Python package, and compile the neural network with mean square error loss and Adam opti- mizer [50] with 0.001 learning rate and the default Keras parameters. We also set the number of training epochs to 200, While we set the batch size for the training cycle to 512.
The size of our input is set to 220 neurons, which corresponds to our dataset's number of features. We tried different hidden layer configurations as shown in Table 7; in order to minimize the reconstruction error computed by mean square error. All encoder/decoder pairs used rectified liner units (ReLU) activation function given in the following formula: Following the training of the autoencoder while reconstructing the original data by encoding and decoding samples from the validation set, allows us to monitor training by TensorBoard web interface by plotting the Reconstruction Error (MSE) or loss function as seen in Fig. 6.
Step 2: Clustering the Encoded dataset By training the autoencoder, we have its encoder part learned to compress each sample into 20 floating point values. We use k-means clustering Algorithm to perform clustering. To do this, we first fit the Encoded data to K-Means clustering algorithm and then determine the optimal number of clusters using Elbow method by measuring the sum of square distances to the nearest cluster center as Fig. 7.

Results and discussion
In this Chapter we introduce the obtained Clustering results for the reduced dataset with PCA transformation and Autoencoder Neural Network. After dimensionality reduction we performed K-Means clustering on the data with reduced set of features.

Clustering evaluation metrics
Different measures were performed for the final evaluation of the clustering algorithm performance, which can be categorized into 2 main types [51]: • External cluster validation: comparing the results we got from cluster analysis to an externally known result, such as externally provided class labels. Since external validation measures know the right cluster number in advance, they are mainly used for choosing an optimal clustering algorithm on a certain data set [52]. • Internal cluster validation: which exploit the internal information of the clustering process to evaluate the performance of a clustering structure without depending on external information. also It can be used to estimate the appropriate and the better clustering algorithm without any external data [52].
In our research, we deal with unlabeled data so we review some internal measures that we can deploy on clustering algorithms to evaluate the quality of clustering results such as: • Silhouette Coefficient The silhouette analysis measures how well the features are clustered and it estimates the average distance between clusters by measuring the similarity of an object is to its own cluster compared to other clusters. The values of the silhouette are ranged between − 1 and + 1, where the higher value of silhouette Fig. 7 Optimal number of clusters for Encoded data point out that the object is well matched to its own cluster and poorly matched to neighboring clusters and vice versa. The silhouette can be calculated with different distance metric, like the 1 Euclidean distance or the Manhattan distance. The silhouette analysis is calculated as below: • DB index The Davies-Bouldin index (DBI), an internal evaluation metric for evaluating clustering algorithms performance based on the average measure of similarity of each cluster with its most similar cluster where similarity is the ratio within-cluster distances to between-cluster distances. Lower the value of the DB index, clustering is better. It can be calculated for k number of clusters by the formula where Ci is the centroid of cluster i, is the average distance of all points in the cluster i to its centroid, d(Ci,Cj) is the distance between centroid Ci and Cj. To obtain results we empirically set different parameters for our model, and evaluate its influence on our result. The tested Autoencoder network Architectures for our dataset has minimized the reconstruction error with 200 epochs and a batch size of 512 using Adam optimizer. The PCA algorithm has saved 90% of the variance with just 20 features which is a good reduction from our original space. We Applied K-Means clustering experiments with data set reduced to a different number of dimensions 20,30,50 features and also evaluated performance without dimensionality reduction to see how this parameter impacts the clustering task besides the reduction method. The mentioned metrics were used in this research to evaluate clustering on both original and the reduced data set. Figure 8 shows the average Silhouette score results for the reduced data set with PCA transformation and Autoencoder Neural Network Applied on 20, 30, 50 dimensions. Also, Fig. 9 shows the Davies-Bouldin scores obtained after applying K-Means Clustering on Reduced data to 20, 30, 50 dimensions using both of PCA and Autoencoder.
There are several more ways to reduce the dimensionality, using PCA and autoencoders could be compared for a deeper understanding of their differences. Looking at the details of each approach's loss function and how it evolves over time can provide extra information in the superiority of the neural network that generalizes better with so little training. As a result, Figs. 8 and 9 demonstrated that the best evaluation scores for this type of data was obtained by using autoencoder neural network for dimensionality reduction and K-Mean Clustering Algorithm, Silhouette score reached to 0.682 with 3 clusters and 0.571 with 5 clusters while the score obtained on the original data with 220 dimensions was 0.581 while the reduced dataset with PCA transformation achieved best 0.476 Silhouette score with 2 clusters. The experimental results indicated that our method is effective in the customer segmentation task and gave an idea why there was a need to move from PCA to autoencoders. Presenting the case for non-linear functions as compared to linear transformations as in PCA.
Given the simplicity of the autoencoder model used here, autoencoder significantly improved both of silhouette and DBI scores in our case study. What is remarkable also is that the best Average silhouette scores is related to the small numbers of clusters like 2 or 3 as shown in Fig. 10 that's an indication to clustering algorithms works high performance whereas it produces more meaningful results with encoded data Fig. 10 Average Silhouette Score for different Clusters because the power of autoencoder to extract meaningful data,therefore most customers have similar behaviors and belong to 2 clusters.

Conclusion and future work
The process of dimensionality reduction was performed using both PCA decomposition and Autoencoder Neural Network built with Keras TensorFlow model to perform clustering analysis with unlabeled datasets. k-Means Clustering algorithm was applied and evaluated on data in original and reduced space of features. Finally, we measured the effects of dimensionality reduction and compared two approaches, PCA and Autoencoder. The Clustering Performance was evaluated using Internal indices such as Silhouette index and Davies-Bouldin index. The best results for this type of data were obtained by autoencoder neural network approach which played a significant role in the dimensional reduction, and K-Mean Clustering Algorithm, where the clustering performance was enhanced with reduced dimensions. For the original data set of 220 features we were able to reduce its dimension to 20 features and obtain results very close or better than original dataset clustering evaluation results. Finally, while a neural network-style encoder performed better with clustering customers in our case study, it may not always be the case that an autoencoder can outperform a PCA representation of the data. If a linear map can sufficiently explain the variance in the data without significant loss of detail, then a neural network may be overkill when a simpler model like PCA can be used effectively.
It's known that PCA and Autoencoders share architectural similarities. But despite this, an Autoencoder by itself does not have PCA properties, e.g. orthogonality. So the incorporating of the PCA properties will bring significant benefits to an Autoencoder, such as resolving vanishing and exploding gradient, and overfitting via regularization. Based on this, properties that we would like Autoencoders to inherit are (Tied weights, Orthogonal weights, Uncorrelated features, and Unit Norm). As a future work, we can Tune and Optimize Autoencoder using PCA principles by building custom constraints for our Autoencoder for tuning and optimization.