Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Fuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.


Introduction
Clustering is one of the most widely used techniques in exploratory data mining for discovering groups of objects with similar behavior or traits. Currently, it is extensively used in data preprocessing, customer segmentation, data partitioning, outlier detection, and data analyses. It helps to learn useful information and extract interesting patterns from real-world data. Clustering is roughly divided into two major categories: hard and soft clustering. In hard clustering, each object belongs to only one cluster at a time, while in soft clustering, an object can belong to more than one cluster simultaneously. Soft clustering, also known as Fuzzy Clustering, is very useful and has widespread applications. Consider a movie recommendation system where we want to cluster users based on their interests. Here, fuzzy clustering is be a better choice because users may be interested in more than one genre and can become frustrated if only one type of content is recommended. There are several areas where fuzzy clustering is a more suitable way of clustering data.
Numerous clustering algorithms are designed to cater to different data types and distributions. However, the existing traditional clustering algorithms are incapable of dealing with the everchanging demands and dynamics of Big Data. To extract value from the massive data, the clustering technique needs to effectively deal with the volume and variety of the Big data. The real-world datasets are usually mixed-mode; they consist of different features like continuous, categorical, textual, spatial, time series, and geometric. However, the most commonly used clustering algorithms: K-Means, K-Medoid, Gaussian Mixture Models, and DBSCAN work with only one type of feature (continuous or categorical). K-means works with numeric features, while K-Mode (an extension of k-means) works with categorical data. To make these algorithms work with multiple types, data analysts transform all features into a single data type recognized by the algorithm using some dummy encoding schemes [1], but this increases data dimension and hence, the computation cost. Moreover, it also results in the loss of valuable information.
Gower's [2] pioneered the work on mixed data, and since then, many algorithms have been proposed in the literature to cluster mixed-mode data. KAMILA and K-prototype are the two most commonly used algorithms, but they work only for two types of features (Continuous and Categorical) and form crisp clusters. Furthermore, they require that the user explicitly define weights for each type of feature. So there is a need for an algorithm that can effectively generate fuzzy clusters of mixed-mode data without explicitly defining weights. Recently, Urso et al. [3] proposed FCMd-MD, a Fuzzy C-Medoids clustering model for mixed data. It uses the idea of the PAM(Partitioning around medoid) algorithm for fuzzy clustering of mixed-mode data. The algorithm learns the weights of different features by optimizing the objective function. Hence, there is no need to define weights for different types of variables. The algorithm achieves significantly good results but cannot handle Big data.
Clustering and extracting valuable information from a large volume of datasets is not an easy task. It brings many issues to the front: storing and processing enormous data, extracting patterns, and detecting similarities between data objects. The current computing power of a system is not enough to process such an enormous amount of data; either we must increase computing power, utilize supercomputers or shift to another more suitable technology. Stand-alone servers can offer limited computing resources, and these resources cannot meet the current-era requirements. We need distributed computing technology to work on our commodity hardware and perform parallel processing on gigantic volumes of data. Apache Spark is a big data framework that was introduced to overcome the limitations of the traditionally distributed framework. It is much faster, scalable, programmer-friendly, and provides unification. It also provides a scalable machine learning library known as MLlib to fulfill the needs of large enterprises and help scholars in different research areas. Numerous machine learning algorithms are included in the MLlib library, but few clustering algorithms, such as K-means and its few variants, are currently provided.
The contribution of this research work is multi-fold. First, we rigorously evaluated the performance of the fuzzy clustering techniques: Fuzzy C-means, Fuzzy C-medoids, and FCMD-MD on different real-world mixed-mode datasets. It is observed that the recently proposed FCMD-MD [3] algorithm outperformed the other techniques, but it is very time-consuming and cannot handle large datasets. In this research, we proposed a distributed FCMD-MD algorithm to perform fuzzing clustering of mixed-mode Big Data. Our algorithm is scalable and can effectively handle massive datasets as it is designed in the Apache Spark framework. We compared the time performance of our algorithm with the sequential FCMD-MD algorithm. We also conducted experiments to compare the performance of our distributed FCMD-MD with the distributed fuzzy k-medoid algorithm in the Apache Spark framework. Furthermore, we also show that the FCMD-MD algorithm is designed for fuzzy clustering, so it does not replace the Kamila algorithm developed for Crisp clusters.
The organization of this paper is as follows: "Related work" section presents the related work, and "Apache Spark" section briefly discusses the details of the Spark framework. "Fuzzy clustering of huge heterogeneous data using Apache Spark" section describes the proposed algorithm in detail. The description of datasets and results of computational experiments are given in "Datasets" and "Computational experiments and analysis" sections, respectively. Finally, the last section is the conclusion.

Related work
The work on mixed data started as early as 1971 when Gower [2] introduced a dissimilarity measure for continuous and categorical variables. Gower distance is computed as an average of different partial dissimilarities across the individual features. After the Gower distance, many algorithms were proposed for the hard clustering of mixed data; however, not much work was done for the fuzzy clustering of mixed data.
Huang [4] proposed a variant of the K-means algorithm called K-prototypes for clustering datasets with continuous and categorical features while maintaining the time efficiency of K-means. The algorithm uses Euclidean and Hamming distances for continuous and categorical variables. Furthermore, it employs decision tree induction algorithms to define the clusters, improving the interpretability of clusters. Bertrand et al. [5] derived an algorithm for clustering medical data with multiple features using Gaussian Mixture Model. Ahmad et al. [6] proposed a K-harmonic type algorithm for clustering mixed data which normalizes and discretizes numerical features in a pre-processing set. Foss et al. [7] proposed a KAMILA algorithm for clustering mixed data. It is considered the state-of-the-art algorithm for clustering data having continuous and categorical features. The algorithm is based on K-means and achieves clustering by equitably balancing the contribution of continuous and categorical features. Skabar [8] proposed an algorithm that uses graph-based random walk clustering of data with mixed attributes without explicitly defining any distance measure.
The work in the field of fuzzy clustering started in 1984 with the development of the Fuzzy C-means algorithm [9], which is a variant of the K-means algorithm and produces fuzzy clusters. After the Fuzzy C-Means algorithm, developing the Fuzzy C-medoids algorithm was not difficult, as, in the traditional K-means algorithm, the centroids are the mean of the given cluster. In contrast, in the K-medoids algorithm, the centroids are actual data points that have the least distance from all the points in a given cluster. Both algorithms tend to minimize the same objective function. The significant difference lies in the selection of medoids, which makes the K-medoid algorithm less sensitive to outliers. Krishnapuram et al. [10] provided a Fuzzy-based implementation of K-medoids. Wang et al. [11] proposed an algorithm that tries to reduce the limitation of initial cluster selection sensitivity of Fuzzy C-Means by selecting initial clusters using entropy. The algorithm works well on arbitrary-shaped clusters. Bezdek et al. [12] gave a probabilistic implementation of fuzzy c-Means and Ulutagay et al. [13] proposed a density-based fuzzy clustering algorithm from the family of DBSCAN.
Most of the work conducted in the area of fuzzy clustering is for one type of attribute. The different algorithms proposed in the literature are tailored for one particular feature type and are incapable of handling real-world datasets with different features [14]. Nguyen et al. [15] proposed an algorithm for the fuzzy partitioning of categorical data. Wang's incremental fuzzy algorithm [16] handles only time-series data, while Urso et al. [14] algorithm forms fuzzy clusters of spatial and temporal data. Few researchers attempted to handle mixed-mode data clustering. Doring et al. [17] proposed a fuzzy clustering approach based on a probabilistic distance measure that uses Mixture Models to cluster data having both continuous and categorical attributes. Urso and Massari [3] developed an algorithm based on the C-medoids clustering model for finding soft clusters in mixed data. The algorithm learns the weights of different features by optimizing the objective function.
Not much work is done on fuzzy clustering of massive mixed data using the latest distributed platforms like Spark. Jha et al. [18] proposed an Apache Spark-based fuzzy clustering algorithm that utilizes kernel Radial Basis Functions (RBF) to discover clusters in high-dimensional genomics data.

Apache Spark
Apache Spark is an open-source big-data processing framework [19]. We have selected SPARK because it is 10 to 100 times faster than Hadoop. It uses the best features of Hadoop, such as HDFS, for the distribution of data across worker nodes and eliminates its shortcomings; hence it is way faster than Hadoop. In the case of an iterative task, Hadoop reads to and fro the disk for each iteration, while Spark's RDD (Resilient Distributed Dataset) enables it to perform multiple iterative tasks in memory. Apache Spark provides high-level functionalities, unlike Hadoop, where a user must understand the low-level details of the system. Spark supports multiple languages and different libraries like MLlib and GraphX for machine learning and graph manipulation tasks.

Spark Core and data structures
Spark Core is the main engine of Apache Spark, which is responsible for cluster management, job scheduling, cluster management, input-output operations, and fault recovery.
RDD is a fundamental data structure in Apache Spark. It is viewed as a distributed set of elements that makes parallel computation possible on data. RDD is immutable, which means that data inside RDD cannot be altered. RDDs provide fault tolerance using a lineage graph. If an RDD is lost, it has information about how it was created, so it can be recreated. Two types of operation are performed on RDDs: transformation and action. Transformation is an operation that returns a new RDD by modifying the existing one, while the action is an operation that performs the computation on the existing RDD and returns results. Spark performs lazy computations; an RDD is not transformed unless action is called on them. Spark chooses a lazy approach to decide the best possible way to execute an action.
Dataframe in Apache Spark is a distributed collection of data that is organized into columns. It carries the same objective as python's pandas DataFrames and R Dataframes. DataFrames supports different SQL-like operations: aggregate, group, filter and join. Existing RDD or external tables can be easily converted to Spark DataFrame.

Fuzzy clustering of huge heterogeneous data using Apache Spark
This section presents a distributed fuzzy clustering algorithm to efficiently cluster large heterogeneous datasets using the distributed framework Apache Spark. Our algorithm is inspired by the FCMD-MD model recently proposed by Urso et al. [3]. We briefly discuss the FCMD-MD model before presenting our Spark-based distributed FCMD-MD algorithm.

Fuzzy clustering model for mixed-mode data (FCMD-MD)
This technique is based on K-medoid's approach and creates a fuzzy partition of data with mixed features. The algorithm works by optimizing the cost function, the weighted sum of dissimilarities of each feature type. Figure 1 gives an example of the working of the FCMD-MD algorithm for data consisting of five different feature types. The figure presents the underlying idea of the FCMD-MD model; different distance measures and weights are applied to each feature type to compute similarities and create fuzzy partitions. Algorithm 1 shows the complete pseudocode of the FCMD-MD algorithm as proposed in the paper [3].
Let X i = {X 1 , . . . , X S } be a data point which is a combination of S types of features. Each X i is a set consisting of one or more variables of a particular type. For example, let us assume that we have two types of variables, hence S = 2 and X i = {X 1 , X 2 } . Further, assume that X 1 is a set of two quantitative variables X 1 = {X 11 , X 12 } and X 2 is a set of three categorical variables, X 2 = {X 21 , X 22 , X 23 } . Depending on the nature of variables, X s can be a vector or can have a more complex structure. For example, if it is the combination of continuous variables, then it is a vector and if X s represents a time series data, then it can be a matrix.
The distance between two data points i and j based on sth feature can be calculated as: s d ij can be Euclidean distance in case of continuous variables and Simple matching coefficient measure in case of categorical variables.

Objective function
The FCMD-MD algorithm tends to minimize the objective function. The objective function as defined in [3] is given below: here: S s=1 w s = 1 u ic and w s are computed by calculating the Lagrangian of the objective functions and are given below (for details, refer to [3]). where: • u ic indicates the membership degree of the ith object in cth cluster, • w s indicates the weights associated with sth feature type, • m represents the fuzziness parameter, • s d ic represents the distance between ith observation and cth cluster according to sth feature, • d 2 ic is the overall weighted squared distance between ith observation and cth cluster medoid.

Distributed fuzzy clustering for mixed-mode data in Apache SPARK
In this section, we present the Spark-based distributed algorithm for fuzzy clustering of mixed-mode data. The input data is pre-processed, partitioned, and persisted in memory as an RDD, Spark's fundamental distributed data structure. Most of the computation is carried out on the persisted RDD to speed up the processing. Algorithm 4 shows the complete pseudocode of distributed FCMD-MD implemented in Apache Spark. The dataset is pre-processed using Algorithm 2. The algorithm 2 creates a key-value RDD and performs different pre-processing tasks such as handle missing values, remove noise/outliers and normalize the continuous attributes. Furthermore, it arranges the different features according to their type and creates a dictionary with feature type as key and a list of features as attributes.
The initial set of medoids is selected randomly from distributed dataRDD using the function "takeSample". The variables k , W , m , and medoids are broadcasted across the cluster as all the executors require them for performing computations. The total cost for the current medoids is computed using Algorithm 3. Figure 2 shows the one complete spark job performed by the distributed version of FCMD. Given the current medoids, the pre-processed data RDD is used to compute distance RDDs for each type of variable that is later joined and transformed to have collective distance RDD. The collective distance RDD is joined with the membership matrix RDD and summed up to get the final cost for current medoids. Depending on the final cost, the swapping of medoids is decided. Several such jobs run in one iteration of the algorithm depending on input data size and the number of clusters, weights for each type of variable, and membership matrix are recomputed after each iteration.

Datasets
The datasets are obtained from the UCI Machine Learning Repository, and Kaggle [20,21]. Table 1 shows the split of attribute types and classes in different datasets. We have considered the dataset of various sizes as we wish to examine the performance of our distributed fuzzy clustering technique and compare it with the sequential FCDM-DM algorithm, which cannot handle large datasets.

Australian Credit
Australian Credit dataset [22] concerns credit card applications; the original names and values of attributes are replaced with meaningless values because of confidentiality purposes. The dataset is interesting as it combines continuous and categorical attributes; there is an almost equal proportion of continuous and categorical attributes.

Cylinder Bands
Cylinder Bands dataset [23] contains information regarding the delay in the printing process. The variables represent different parts of the printing process, such as paper size, ink color, ink type, and cylinder size. The class label describes whether the banding occurred during the process of printing or not. This dataset contains far fewer continuous features than categorical features. This dataset contains many missing values; hence it is prepossessed. The variable that contains more than eight missing values is dropped. If the variable contains 1 to 8 missing values, the corresponding rows are removed from the dataset.

Online shoppers intention (OSI)
OSI dataset [24] represents the user's intention given the combination of mixed types of features; label is a Boolean feature that represents whether the user checks out something or not. This dataset compares the performance of distributed FCMD-MD with its serialized version, as the previous two are not large enough to show the effectiveness of Distributed FCMD-MD. The table shows the split of attribute types and classes.

AirBnB
Airbnb dataset [25] is publicly made available by Airbnb. It shows the rental listing of AirBnB in New York City for the year 2019. The dataset is a combination of numerical, categorical, and geometrical features that are preprocessed to drop some meaningless features.

Computational experiments and analysis
The rigorous computational experiments are conducted to identify the best technique for fuzzy clustering of a mixed-mode dataset in a sequential and distributed mode. The performance of FCMD-MD on different datasets is reported in this section. First, we conducted experiments to evaluate the FCMD-MD algorithm's effectiveness on Cylinder Bands and Australian Credit datasets and compare it with state-of-the-art fuzzy clustering algorithms. We also compare FCMD-MD performance with a hard-clustering algorithm, KAMILA designed for mixed-mode data. Furthermore, we compare the sequential FCMD-MD algorithm with the proposed distributed version implemented in Apache SPARK; the time taken by both versions of the algorithm is reported on OSI and AirBnB datasets.

Experimental setup
This section gives details of the experimental setup used to run distributed and sequential versions of FCMD-MD. Table 2 shows the specifications of a machine used to compute the results for the sequential version of FCMD-MD. Table 3 shows the specification of cluster acquired on Google DataProc; out of five worker nodes, one is used as cluster master while the remaining four work as regular worker nodes.

Evaluation metrics
The paper uses two types of evaluation metrics: Purity and Fuzzy Rand Index (FRI). Purity is an external evaluation measure for clustering; it calculates how pure the clusters are by computing the most common class type in each cluster. Purity can be defined for each cluster. It counts the number of points of a majority class, then take the sum over all clusters and divide it by the total number of data points.
The fuzzy Rand Index is used as an evaluation metric to analyze the quality of the generated fuzzy clusters. Rand Index is an external evaluation measure that requires true labels. It checks whether each pair of point belong to the same cluster or not; one can see Rand Index as Accuracy:  Table 3 Cluster specifications

Specification Value
Worker nodes 5 Cores per worker node 8 Memory per worker node 30GB Operating system Linux • TP: Two points should belong to the same cluster and our algorithm assigns them to the same cluster • TN: Two points should belong to different clusters and our algorithm assigns them to different clusters • FP: Two points should not belong to the same cluster but our algorithm assigns them to the same cluster • FN: Two points should belong to different clusters, but our algorithm assigns them to different clusters Fuzzy Rand Index is a fuzzy variant of the Rand Index [26]. Its value ranges from 0 to 1.

Comparison of FCMD-MD with Fuzzy clustering algorithms on mixed-mode data
This section compares the FCMD-MD algorithm with two state-of-the-art fuzzy-based clustering algorithms: Fuzzy C-Means and Fuzzy C-Medoid. The fuzzy rand index is RI = TP + TN TP + TN + FP + FN Table 4 Average Fuzzy Rand Index achieved by FCMD-MD, Fuzzy C-Means and Fuzzy C-Medoids   used as a performance metric. Both Fuzzy C-Means and Fuzzy C-Medoids algorithms work only with continuous data. Therefore, the categorical features are transformed into continuous features using dummy encoding, increasing the dimensionality of both datasets. Table 4 and Fig. 3 show the results of the comparison of fuzzy clustering techniques. The FCMD-MD algorithm outperformed both Fuzzy C-Means and Fuzzy C-Medoids for both datasets. It achieved 66.4% FRI on the Australian Credit data set, while Fuzzy C-Means and Fuzzy C-Medoids scored relatively less, around 61%. Table 5 shows the time taken by different techniques. The time consumption of the FCMD-MD algorithm is relatively high compared to other techniques, and this highlights the need for a distributed FCMD-MD algorithm that can speed up the algorithm while maintaining its performance.

Comparison of FCMD-MD with hard-based clustering algorithm (KAMILA)
This section compares the FCMD-MD algorithm with a state-of-the-art hard-based clustering algorithm KAMILA, for continuous and categorical variables. The FCMD-MD technique produces fuzzy partitions of data, so to compare it with KAMILA, the fuzzy partitions are converted to hard partitions by assigning each data point to a cluster with a maximum value in the membership matrix. Table 6 shows the comparison of average purity achieved by FCMD-MD and KAMILA. It is observed from the results that the FCMD-MD could not outperform KAMILA for hard-based clustering of data. Table 7 shows the time consumed by   FCMD-MD and KAMILA on both datasets. The time taken by FCMD-MD is greater than KAMILA. It is evident from this experiment that FCMD is not ideal for crisp clusters, and it is not a replacement for a hard-clustering algorithm. It is designed for fuzzy clustering and should only be used when fuzzy clusters are required for mixedmode data. Tables 8 and 9 show the performance of the FCMD-MD algorithm under six different experimental settings. The algorithm is executed ten times in each run the minimum, maximum, and average values of purity are reported. The experimental results show that the algorithm performs best when continuous features are normalized. In experiments 2 and 5, the FCDM-DM performs best on the Australian Credit and Cylinder   bands dataset. Note that the continuous features are normalized in both of these bestperforming experiments. Table 10 shows the weights assigned by FCMD-MD to each feature type. For the Australian Credit dataset, continuous features got more importance, while categorical features were assigned more weights for the Cylinder Bands dataset.

Distributed FCMD-MD results
We conducted experiments to compare the running time of distributed and sequential FCMD-MD algorithm. The experiments were executed on OSI and AirBnB datasets as they are relatively big. The parameters used to run distributed FCMD-MD on the acquired cluster are shown in Table 11. Table 12 shows the time consumed by both    versions of FCMD-MD. The distributed version outperformed the sequential version with regard to time and is a lot faster which can be further improved depending on the specifications of the cluster and optimal parameters to run the algorithm on a cluster. Tables 13 and 14 show the convergence of the distributed FCMD-MD on OSI and AirBnB datasets. It is observed that the algorithm's convergence is quite fast, and it converged in 3 iterations on the OSI dataset, while it took only two iterations to converge on the Airbnb dataset. Table 15 shows the weights assigned by the distributed FCMD-MD to each variable on OSI and AirBnB datasets. For the OSI dataset, the continuous features got more importance as compared to categorical features, while the algorithm assigned less than one percent weightage to geometric features for the Airbnb dataset and categorical features got the most importance.

Comparison of the distributed FCMD-MD and Fuzzy C-Medoid
This section compares the performance of the distributed FCMD-MD with the distributed Fuzzy C-Medoid algorithm on the OSI dataset. Table 16 shows the convergence time of both algorithms on the OSI dataset, while Table 17 shows the Fuzzy Rand Index attained by both algorithms. The distributed FCMD-MD algorithm outperformed the Fuzzy C-Medoid; however, it took a bit more time than the distributed fuzzy c-medoid.

Analysis
The FCMD-MD algorithm works significantly well on data with mixed types of features. It outperformed the most commonly used algorithm for fuzzy-based clustering and proved more effective than Fuzzy C-Medoid. The performance of the FCMD-MD algorithm improves when continuous attributes are normalized, as shown in Tables 8  and 9. The basic idea of the FCMD-MD algorithm is based on the K-Medoid algorithm. Hence, it is a bit slow as it inherits all the limitations of K-Medoid. Moreover, it has to perform more work while computing the membership matrix of data points and assigning weights to each feature type. Due to this reason, there was a need to improve its time consumption. Hence, the distributed version of FCMD-MD is proposed in this paper, which shows promising results on different datasets. Even on small-sized datasets, the distributed version reduced the time consumption of the sequential version by half, as shown in Table 12. Another exciting aspect of this algorithm is its fast convergence. As we can see from Tables 13 and 14, the algorithm provides significant improvement with each iteration.

Conclusion and future work
A scalable SPARK-based distributed version of the FCMD-MD algorithm is presented in this research work. The algorithm gave promising results in the fuzzy clustering of data with mixed types of features, and it outperformed the most commonly used fuzzy-based clustering algorithms like Fuzzy C-means and Fuzzy C-medoid. The distributed version of FCMD-MD improves the computation time and can cluster enormous datasets effectively. Future work includes the performance and computation time of distributed FCMD-MD on massive mixed-mode datasets using a large-capacity cluster.