Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Akram, Abdul Wahab; Alamgir, Zareen

doi:10.1186/s40537-022-00671-7

Research
Open access
Published: 21 December 2022

Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Abdul Wahab Akram¹^na1 &
Zareen Alamgir¹^na1

Journal of Big Data volume 9, Article number: 121 (2022) Cite this article

2558 Accesses
4 Citations
Metrics details

This article has been updated

Abstract

Fuzzy clustering is an invaluable data mining technique that allows each data point to belong to more than one cluster with some degree of membership. It is widely employed in exploratory data mining to discover overlapping communities in social networks, find structure in spectral data, and capture user interests in recommendation systems. Nowadays, the variety and volume of data are increasing at a tremendous rate. Data is power; the massive data, along with an effective technique, can unravel valuable information. The existing fuzzy clustering algorithms do not perform well on massive heterogeneous datasets. Processing an enormous amount of data is beyond the capacity of a single processor. The need of the hour is to develop fuzzy clustering techniques that can work on a distributed framework for Big Data processing and can handle heterogeneous data. In this research, we evaluate the performance of the recently proposed algorithm for the Fuzzy clustering of mixed-mode data FCMD-MD (D’Urso and Massari in Inf Sci 505:513–534, 2019) with different real-world datasets. We develop a distributed FCMD-MD, a fuzzy clustering algorithm for mixed-mode data in Apache SPARK. The experimental results show that the algorithm is scalable, performs well in a distributed environment, and clusters enormous heterogeneous data with high accuracy. We also compared the performance of distributed FCMD-MD and the distributed k-medoid algorithm.

Introduction

Clustering is one of the most widely used techniques in exploratory data mining for discovering groups of objects with similar behavior or traits. Currently, it is extensively used in data preprocessing, customer segmentation, data partitioning, outlier detection, and data analyses. It helps to learn useful information and extract interesting patterns from real-world data. Clustering is roughly divided into two major categories: hard and soft clustering. In hard clustering, each object belongs to only one cluster at a time, while in soft clustering, an object can belong to more than one cluster simultaneously. Soft clustering, also known as Fuzzy Clustering, is very useful and has widespread applications. Consider a movie recommendation system where we want to cluster users based on their interests. Here, fuzzy clustering is be a better choice because users may be interested in more than one genre and can become frustrated if only one type of content is recommended. There are several areas where fuzzy clustering is a more suitable way of clustering data.

Numerous clustering algorithms are designed to cater to different data types and distributions. However, the existing traditional clustering algorithms are incapable of dealing with the everchanging demands and dynamics of Big Data. To extract value from the massive data, the clustering technique needs to effectively deal with the volume and variety of the Big data. The real-world datasets are usually mixed-mode; they consist of different features like continuous, categorical, textual, spatial, time series, and geometric. However, the most commonly used clustering algorithms: K-Means, K-Medoid, Gaussian Mixture Models, and DBSCAN work with only one type of feature (continuous or categorical). K-means works with numeric features, while K-Mode (an extension of k-means) works with categorical data. To make these algorithms work with multiple types, data analysts transform all features into a single data type recognized by the algorithm using some dummy encoding schemes [1], but this increases data dimension and hence, the computation cost. Moreover, it also results in the loss of valuable information.

Gower’s [2] pioneered the work on mixed data, and since then, many algorithms have been proposed in the literature to cluster mixed-mode data. KAMILA and K-prototype are the two most commonly used algorithms, but they work only for two types of features (Continuous and Categorical) and form crisp clusters. Furthermore, they require that the user explicitly define weights for each type of feature. So there is a need for an algorithm that can effectively generate fuzzy clusters of mixed-mode data without explicitly defining weights. Recently, Urso et al. [3] proposed FCMd-MD, a Fuzzy C-Medoids clustering model for mixed data. It uses the idea of the PAM(Partitioning around medoid) algorithm for fuzzy clustering of mixed-mode data. The algorithm learns the weights of different features by optimizing the objective function. Hence, there is no need to define weights for different types of variables. The algorithm achieves significantly good results but cannot handle Big data.

Clustering and extracting valuable information from a large volume of datasets is not an easy task. It brings many issues to the front: storing and processing enormous data, extracting patterns, and detecting similarities between data objects. The current computing power of a system is not enough to process such an enormous amount of data; either we must increase computing power, utilize supercomputers or shift to another more suitable technology. Stand-alone servers can offer limited computing resources, and these resources cannot meet the current-era requirements. We need distributed computing technology to work on our commodity hardware and perform parallel processing on gigantic volumes of data. Apache Spark is a big data framework that was introduced to overcome the limitations of the traditionally distributed framework. It is much faster, scalable, programmer-friendly, and provides unification. It also provides a scalable machine learning library known as MLlib to fulfill the needs of large enterprises and help scholars in different research areas. Numerous machine learning algorithms are included in the MLlib library, but few clustering algorithms, such as K-means and its few variants, are currently provided.

The contribution of this research work is multi-fold. First, we rigorously evaluated the performance of the fuzzy clustering techniques: Fuzzy C-means, Fuzzy C-medoids, and FCMD-MD on different real-world mixed-mode datasets. It is observed that the recently proposed FCMD-MD [3] algorithm outperformed the other techniques, but it is very time-consuming and cannot handle large datasets. In this research, we proposed a distributed FCMD-MD algorithm to perform fuzzing clustering of mixed-mode Big Data. Our algorithm is scalable and can effectively handle massive datasets as it is designed in the Apache Spark framework. We compared the time performance of our algorithm with the sequential FCMD-MD algorithm. We also conducted experiments to compare the performance of our distributed FCMD-MD with the distributed fuzzy k-medoid algorithm in the Apache Spark framework. Furthermore, we also show that the FCMD-MD algorithm is designed for fuzzy clustering, so it does not replace the Kamila algorithm developed for Crisp clusters.

The organization of this paper is as follows: “Related work” section presents the related work, and “Apache Spark” section briefly discusses the details of the Spark framework. “Fuzzy clustering of huge heterogeneous data using Apache Spark” section describes the proposed algorithm in detail. The description of datasets and results of computational experiments are given in “Datasets” and “Computational experiments and analysis” sections, respectively. Finally, the last section is the conclusion.

Related work

The work on mixed data started as early as 1971 when Gower [2] introduced a dissimilarity measure for continuous and categorical variables. Gower distance is computed as an average of different partial dissimilarities across the individual features. After the Gower distance, many algorithms were proposed for the hard clustering of mixed data; however, not much work was done for the fuzzy clustering of mixed data.

Huang [4] proposed a variant of the K-means algorithm called K-prototypes for clustering datasets with continuous and categorical features while maintaining the time efficiency of K-means. The algorithm uses Euclidean and Hamming distances for continuous and categorical variables. Furthermore, it employs decision tree induction algorithms to define the clusters, improving the interpretability of clusters. Bertrand et al. [5] derived an algorithm for clustering medical data with multiple features using Gaussian Mixture Model. Ahmad et al. [6] proposed a K-harmonic type algorithm for clustering mixed data which normalizes and discretizes numerical features in a pre-processing set. Foss et al. [7] proposed a KAMILA algorithm for clustering mixed data. It is considered the state-of-the-art algorithm for clustering data having continuous and categorical features. The algorithm is based on K-means and achieves clustering by equitably balancing the contribution of continuous and categorical features. Skabar [8] proposed an algorithm that uses graph-based random walk clustering of data with mixed attributes without explicitly defining any distance measure.

The work in the field of fuzzy clustering started in 1984 with the development of the Fuzzy C-means algorithm [9], which is a variant of the K-means algorithm and produces fuzzy clusters. After the Fuzzy C-Means algorithm, developing the Fuzzy C-medoids algorithm was not difficult, as, in the traditional K-means algorithm, the centroids are the mean of the given cluster. In contrast, in the K-medoids algorithm, the centroids are actual data points that have the least distance from all the points in a given cluster. Both algorithms tend to minimize the same objective function. The significant difference lies in the selection of medoids, which makes the K-medoid algorithm less sensitive to outliers. Krishnapuram et al. [10] provided a Fuzzy-based implementation of K-medoids. Wang et al. [11] proposed an algorithm that tries to reduce the limitation of initial cluster selection sensitivity of Fuzzy C-Means by selecting initial clusters using entropy. The algorithm works well on arbitrary-shaped clusters. Bezdek et al. [12] gave a probabilistic implementation of fuzzy c-Means and Ulutagay et al. [13] proposed a density-based fuzzy clustering algorithm from the family of DBSCAN.

Most of the work conducted in the area of fuzzy clustering is for one type of attribute. The different algorithms proposed in the literature are tailored for one particular feature type and are incapable of handling real-world datasets with different features [14]. Nguyen et al. [15] proposed an algorithm for the fuzzy partitioning of categorical data. Wang’s incremental fuzzy algorithm [16] handles only time-series data, while Urso et al. [14] algorithm forms fuzzy clusters of spatial and temporal data. Few researchers attempted to handle mixed-mode data clustering. Doring et al. [17] proposed a fuzzy clustering approach based on a probabilistic distance measure that uses Mixture Models to cluster data having both continuous and categorical attributes. Urso and Massari [3] developed an algorithm based on the C-medoids clustering model for finding soft clusters in mixed data. The algorithm learns the weights of different features by optimizing the objective function.

Not much work is done on fuzzy clustering of massive mixed data using the latest distributed platforms like Spark. Jha et al. [18] proposed an Apache Spark-based fuzzy clustering algorithm that utilizes kernel Radial Basis Functions (RBF) to discover clusters in high-dimensional genomics data.

Apache Spark

Apache Spark is an open-source big-data processing framework [19]. We have selected SPARK because it is 10 to 100 times faster than Hadoop. It uses the best features of Hadoop, such as HDFS, for the distribution of data across worker nodes and eliminates its shortcomings; hence it is way faster than Hadoop. In the case of an iterative task, Hadoop reads to and fro the disk for each iteration, while Spark’s RDD (Resilient Distributed Dataset) enables it to perform multiple iterative tasks in memory. Apache Spark provides high-level functionalities, unlike Hadoop, where a user must understand the low-level details of the system. Spark supports multiple languages and different libraries like MLlib and GraphX for machine learning and graph manipulation tasks.

Spark Core and data structures

Spark Core is the main engine of Apache Spark, which is responsible for cluster management, job scheduling, cluster management, input–output operations, and fault recovery.

RDD is a fundamental data structure in Apache Spark. It is viewed as a distributed set of elements that makes parallel computation possible on data. RDD is immutable, which means that data inside RDD cannot be altered. RDDs provide fault tolerance using a lineage graph. If an RDD is lost, it has information about how it was created, so it can be recreated. Two types of operation are performed on RDDs: transformation and action. Transformation is an operation that returns a new RDD by modifying the existing one, while the action is an operation that performs the computation on the existing RDD and returns results. Spark performs lazy computations; an RDD is not transformed unless action is called on them. Spark chooses a lazy approach to decide the best possible way to execute an action.

Dataframe in Apache Spark is a distributed collection of data that is organized into columns. It carries the same objective as python’s pandas DataFrames and R Dataframes. DataFrames supports different SQL-like operations: aggregate, group, filter and join. Existing RDD or external tables can be easily converted to Spark DataFrame.

Fuzzy clustering of huge heterogeneous data using Apache Spark

This section presents a distributed fuzzy clustering algorithm to efficiently cluster large heterogeneous datasets using the distributed framework Apache Spark. Our algorithm is inspired by the FCMD-MD model recently proposed by Urso et al. [3]. We briefly discuss the FCMD-MD model before presenting our Spark-based distributed FCMD-MD algorithm.

Fuzzy clustering model for mixed-mode data (FCMD-MD)

This technique is based on K-medoid’s approach and creates a fuzzy partition of data with mixed features. The algorithm works by optimizing the cost function, the weighted sum of dissimilarities of each feature type. Figure 1 gives an example of the working of the FCMD-MD algorithm for data consisting of five different feature types. The figure presents the underlying idea of the FCMD-MD model; different distance measures and weights are applied to each feature type to compute similarities and create fuzzy partitions. Algorithm 1 shows the complete pseudocode of the FCMD-MD algorithm as proposed in the paper [3].

Let $X_{i}=\left\{ X_{1},\ldots,X_{S} \right\}$ be a data point which is a combination of $S$ types of features. Each $X_i$ is a set consisting of one or more variables of a particular type. For example, let us assume that we have two types of variables, hence $S=2$ and $X_{i}=\left\{ X_{1}, X_{2} \right\}$. Further, assume that $X_{1}$ is a set of two quantitative variables $X_{1}=\left\{ X_{11},X_{12} \right\}$ and $X_{2}$ is a set of three categorical variables, $X_{2}=\left\{ X_{21},X_{22},X_{23} \right\}$. Depending on the nature of variables, $X_{s}$ can be a vector or can have a more complex structure. For example, if it is the combination of continuous variables, then it is a vector and if $X_{s}$ represents a time series data, then it can be a matrix.

The distance between two data points $i$ and $j$ based on $s{\text{th}}$ feature can be calculated as:

$$_{s}{d_{ij} = d\left( X_{is}, X_{js} \right) }$$

$_{s}{d_{ij}}$ can be Euclidean distance in case of continuous variables and Simple matching co-efficient measure in case of categorical variables.

Objective function

The FCMD-MD algorithm tends to minimize the objective function. The objective function as defined in [3] is given below:

$$\sum_{i=1}^{n}\sum _{c=1}^{C}u_{ic}^{m}d_{ic}^{2} = \sum_{i=1}^{n}\sum_{c=1}^{C}u_{ic}^{m}\sum_{s=1}^{S}\left(w_{s}\cdot _{s}d_{ic} \right) ^{2}$$

(1)

here:

$$\sum _{c=1}^{C}u_{ic}= 1$$

(2)

$$\sum _{s=1}^{S}w_{s}= 1$$

(3)

$u_{ic}$ and $w_{s}$ are computed by calculating the Lagrangian of the objective functions and are given below (for details, refer to [3]).

$$u_{ic} = \frac{1}{\sum _{c^{\prime}=1}^{C}\left[ \frac{\sum _{s=1}^{S}\left( w_{s} \cdot _{s}d_{ic} \right) ^{2}}{\sum _{s=1}^{S}\left( w_{s} \cdot _{s}d_{ic'} \right) ^{2}} \right] }$$

(4)

$$w_{s}=\frac{1}{\sum _{s^{\prime}=1}^{S}\left[ \frac{\sum _{i=1}^{n}\sum _{c=1}^{C}u_{ic}^{m} \cdot _{s}d_{ic}^{2}}{\sum _{i=1}^{n}\sum _{c=1}^{C}u_{ic}^{m} \cdot _{s'}d_{ic}^{2}} \right] }$$

(5)

where:

$u_{ic}$ indicates the membership degree of the $i{\text{th}}$ object in $c{\text{th}}$ cluster,
$w_{s}$ indicates the weights associated with $s{\text{th}}$ feature type,
$m$ represents the fuzziness parameter,
$_sd_{ic}$ represents the distance between $i{\text{th}}$ observation and $c{\text{th}}$ cluster according to $s{\text{th}}$ feature,
$d_{ic}^2$ is the overall weighted squared distance between $i{\text{th}}$ observation and $c{\text{th}}$ cluster medoid.

Distributed fuzzy clustering for mixed-mode data in Apache SPARK

In this section, we present the Spark-based distributed algorithm for fuzzy clustering of mixed-mode data. The input data is pre-processed, partitioned, and persisted in memory as an RDD, Spark’s fundamental distributed data structure. Most of the computation is carried out on the persisted RDD to speed up the processing.

Algorithm 4 shows the complete pseudocode of distributed FCMD-MD implemented in Apache Spark. The dataset is pre-processed using Algorithm 2. The algorithm 2 creates a key-value RDD and performs different pre-processing tasks such as handle missing values, remove noise/outliers and normalize the continuous attributes. Furthermore, it arranges the different features according to their type and creates a dictionary with feature type as key and a list of features as attributes.

The initial set of medoids is selected randomly from distributed dataRDD using the function “takeSample”. The variables $k$, $W$, $m$, and $medoids$ are broadcasted across the cluster as all the executors require them for performing computations. The total cost for the current medoids is computed using Algorithm 3. Figure 2 shows the one complete spark job performed by the distributed version of FCMD. Given the current medoids, the pre-processed data RDD is used to compute distance RDDs for each type of variable that is later joined and transformed to have collective distance RDD. The collective distance RDD is joined with the membership matrix RDD and summed up to get the final cost for current medoids. Depending on the final cost, the swapping of medoids is decided. Several such jobs run in one iteration of the algorithm depending on input data size and the number of clusters, weights for each type of variable, and membership matrix are recomputed after each iteration.

Datasets

The datasets are obtained from the UCI Machine Learning Repository, and Kaggle [20, 21]. Table 1 shows the split of attribute types and classes in different datasets. We have considered the dataset of various sizes as we wish to examine the performance of our distributed fuzzy clustering technique and compare it with the sequential FCDM-DM algorithm, which cannot handle large datasets.

Table 1 Overview of datasets

Distributed fuzzy clustering algorithm for mixed-mode data in Apache SPARK

Abstract

Introduction

Related work

Apache Spark

Spark Core and data structures

Fuzzy clustering of huge heterogeneous data using Apache Spark

Fuzzy clustering model for mixed-mode data (FCMD-MD)

Objective function

Distributed fuzzy clustering for mixed-mode data in Apache SPARK

Datasets

Australian Credit

Cylinder Bands

Online shoppers intention (OSI)

AirBnB

Computational experiments and analysis

Experimental setup

Evaluation metrics

Comparison of FCMD-MD with Fuzzy clustering algorithms on mixed-mode data

Comparison of FCMD-MD with hard-based clustering algorithm (KAMILA)

Effect of normalization techniques on FCMD-MD clustering

Distributed FCMD-MD results

Comparison of the distributed FCMD-MD and Fuzzy C-Medoid

Analysis

Conclusion and future work

Availability of data and materials

Change history

22 March 2023

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article