As mentioned before, services and sources like sensor networks, cloud storage, social networks and etc. produce a massive amount of data that its management, reuse, and analysis are an indispensable need of today’s world. The main goal of clustering techniques on massive data is to improve the partition and segregation of this data. The resulting data, then, can be used with less complexity and at a higher pace.
In the following, different types of big data clustering algorithms are discussed. Generally big data clustering algorithms are divided into two categories: (1) single machine clustering algorithms and (2) Multiple machine clustering algorithm. In recent years, multiple machine clustering algorithms have been paid more attention than single machine clustering algorithms, because of their less execution time and capability to handle bigger size of data. As shown in Fig. 2, single and multiple machine algorithms are composed of different techniques: (1) Single machine clustering that contains sampling techniques or dimension reduction techniques, and (2) Multiple machine clustering algorithms that contains parallel clustering or MapReduce [11] based clustering. Based on the above categorization.
Single-machine clustering techniques
Single machine techniques consist of sampling and dimensionality reduction techniques, which are elaborated in the following. These algorithms are the first group of clustering algorithms that were used for performance promotion and scale development and their goal is to contrast with the exponential space state in the clustering.
These algorithms are considered as sampling because they perform the clustering process on a sample of dataset and generalize the results to the entire dataset rather than just performing the clustering on the entire dataset. This makes the algorithm faster because the computations are done on a smaller volume of data and as a result the time and space complexity of these algorithms are lower.
CLARANS
Before addressing CLARANS algorithm [12], its primary version, CLARA [13], is discussed. This method is more powerful and capable of performing on big datasets compared with PAM [14] that partition around the center. CLARA reduces the quadratic and time complexity needed for algorithm execution to the linear function of the number of data. PAM computes all the non-similarity matrices among data for the entire data set and saves them in RAM. So it has \(O(n^2)\) space complexity that cannot be used for large amounts of n.
To overcome this problem, CLARA doesn’t compute the non-similarity matrix for the entire dataset. PAM and CLARA can be considered equal to graph search problems that each vertex is a possible solution for clustering and two vertexes are connected when they have differences in one center. PAM starts with a randomly chosen vertex and greedily moves to the neighbor vertexes unless it finds no better neighbor. CLARA does this search in a sub graph, instead of the whole graph, which has a number of O(k). In order to improve the quality and scalability of CLARA, another algorithm CLARANS was suggested. That was the combination of sampling technique and PAM algorithm. Unlike CLARA, this algorithm does not restrict itself to constant samples resulted from sampling. CLARANS is presented for the improvement of CLARA. Similar to PAM, CLARANS algorithm searches all the graphs to find the optimal solution. However, here, in each step it investigates some instances of the current neighbor vertexes. Both CLARA and CLARANS use the sampling technique, but their difference is in how they execute this technique. Sampling in CLARA happens at the beginning and the search space limits to a sub-graph while in CLARANS, the sampling applies dynamically in each step of algorithm execution. Results show that CLARANS use of dynamic sampling made it more efficient compared to CLARA [12].
BIRCH
If data exceed RAM capacity the I/O operations amplify the computation cost as well as execution time. BIRCH algorithm [15] is a solution to this problem. BIRCH uses its exclusive data structure and called clustering feature(CF) and its tree which is CF-tree (Clustering Feature tree). In fact, CF consists of a summary of each cluster. In BIRCH, CF metadata is saved in the main RAM. Each CF is a triple containing \(<N, LS, SS>\) which respectively shows the number of data in the cluster, the aggregation of cluster data, and the sum of data cluster squares. In order to merge two clusters, their CF triples are mutually added. The advantage of this method is that for merging two clusters, just CF triples are added without any demand for the data into clusters and no computation complexity is imposed on the algorithm. There are two main phases in BIRCH: (1) Dataset is checked out and the CF-tree is created in main memory, and (2) Clustering is done on the CF-tree. Experiments show that CLARANS had a better performance both in time and space complexity in comparison with BIRCH and also has a more efficient execution when facing noise data.
CURE
In the aforementioned algorithms, each cluster represents by a point and the clustering algorithm started data clustering based on this approach. Until data are spherical, this process is suitable but in the real world, data might be much more complicated. To address this issue, CURE [16] algorithm uses more than one points to show the concept of clustering. In fact, this algorithm assumes each data as a cluster and step by step merges different clusters to reach the number of predefined clusters. In the process used for merging two clusters at each stage, two data with the least corresponding distance merge with each other. Two data structures, heap, and K-d tree [17] are used in CURE to enhance the clustering quality. Heap is used to preserving the distance among clusters and K-d is used to save the points showing the clusters. CURE also uses sampling techniques to accelerate the computations. First, it uses a sample of the dataset and then performs the above-mentioned process on it. Chernoff bound is used to specify the sample volume. The volume sample is high if the data is massive and the repetition of sampling consumes much time. The CURE uses partitioning to accelerate the algorithm. If we assume n as the main dataset and m as the sample dataset, CURE partitions m into p parts and executes clustering in a hierarchical manner on p to reach the termination conditions of the algorithm. Then another clustering algorithm is done on the entire p. Finally, all samples excluded in m are assigned to the closest cluster. The results show that the algorithm execution of CURE is slow but this algorithm is more resistant to noise data.
BKSK
[18] proposes an algorithm based on batches that are clustered in parallel on single machine. The proposed algorithm with a split dataset consists of several stages. The input dataset is divided into batches. Clustering is applied to each batch as a separate dataset. The initial centroids are selected randomly. Each batch is processed in parallel until the convergence condition is met. The algorithm minimizes the sum of squared errors for all clusters. As a result, Data partitioning into the batches before clustering and their parallel processing reduce the computation time.
Although the time and place complexity of clustering algorithms is dependent on the number of data in the dataset, on the other hand, the many dimensions of data is another crucial challenge. In fact, more data dimensions are equal to more features which results in longer algorithm execution time. Sampling techniques decrease the volume of the dataset but have no effect on dimension decrease.
Locality-preserving projection
In locality-preserving projection method, it’s essential that after projecting data with many dimensions into fewer dimensions, the distance among points still remains and these dimension reductions have no effect on points distance and do not harm data generality. So data distance in reduced dimension space should be an optimal approximation of points distance in the primary dimension space. Random projection is done by data is a linear transformation matrix that contains the primary data of the dataset. If matrix R considered as a rotation matrix and \(d\times t (t<<d)\) (which d is the number of R dimensions and t is the number of projected matrix dimensions), and each cell of R as R(i, j) is a random variable then \(A^{'}=A.R\) is the matrix projection of A with t dimensions. Creating a rotation matrix is different from other projection algorithms. The rotation matrix is created by random values that have a normal distribution with mean 0 and variance 1. Of course, this is just one of the available methods to create the aforementioned matrix. Clustering applies after transformation and projection of matrix A to \(A^{'}\). [19] and [20] are some examples of implementing this algorithm and recently are presented in method [21] to reduce the execution time and improve its function.
Global projection
In global projection, the main goal for the projected data is to be very close to the main data, but in local projection, the goal is that points in initial space and the projection space have a good approximation from each other. In other words, if the data primary matrix is A and the approximation matrix is \(A^{'}\), in global projection the goal is that the amount of \(\left\| A^{'}-A\right\|\) is minimized. Singular value decomposition (SVD) [22], CX/CUR [23], CMD [23] and Colibri [24] are among global projection algorithms.
Multi-machine clustering techniques
Sampling and reduction methods improve the performance of clustering algorithms on big datasets, But due to the growing data, these methods are no longer effective. That’s why we need to look for algorithms that can be executed non-centrally like parallel and distributed paradigms.
Generally, non-centralized clustering algorithms can be categorized in to parallel paradigm-based and MapReduce-based algorithms.
In parallel clustering [25], developer are concerned with not only parallelization challenges but also with data partition challenges. The difference between parallel paradigm and MapReduce paradigm is the comfort for the developers. That is, in MapReduce paradigm all data partitioning and data transitions among machine are done by the system. This feature improves parallelization and reliability.
The overall procedure of non-centralized clustering algorithms is shown in Fig. 3, In the first step the data is segmented among different machines. Then, each machine starts clustering its dataset. The main challenges here are minimizing the data transfer traffic among machines, and lower accuracy compared to sequential model. The lower accuracy happens because of the following reasons: (1) different machines can execute different clustering algorithms, and (2) the manner of data segmentation may changes the final algorithm results.
In parallel algorithms, the data is in a shared storage. By accessing this shared storage, different processors pick up data pieces and process them. On the other hand, in distributed algorithms, data is distributed across multiple physical machines, and each physical machine performs the necessary processing on its own data. In other words, in parallel algorithms, processing is parallelized, but in distributed algorithms, both data and processing are parallelized.
Parallel clustering methods
parallel clustering algorithm are complex for the developers, they are still available solution when volume data increases. In this subsection some of these algorithms are discussed in the following:
DBDC [26, 27] is a parallel density based [28] algorithm. In density based algorithms, the main goal is identifying clusters from the data shape. The density of points inside the cluster are much more than outside of the cluster. In additions, the density of noise points is much less than the density of clusters. here clustering performs locally and globally. For local clustering a default algorithms is used. For global clustering a density based algorithm named DBSCAN is used to finalize the results [29]. The results show that DBSCAN preserves the quality of clustering. Its execution time is 30 times faster than the sequential version of it.
parMETIS [30] is the parallelized version of the METIS [31]. parMETIS is the graph clustering and partitioning algorithms. The parMETIS algorithm is implemented in 3 general steps. In the first step, a general subgraph of the data is extracted. This subgraph is selected according to the degree of vertices. In the next step, the initial division is performed and the initial subgraph is clustered. And in the third step, the vertices of each subgraph are navigated using the breadth-first traversal algorithm. Finally, these three phases clustered the main graph into several clusters(partitions).
The new topic in parallel computation is to use GPU instead of CPU to increase computation speed, because GPU is capable of computing millions or even billions calculations in only one second. G-DBSCAN is the distributed and GPU based clustering algorithm and it is DBSCAN density based. This algorithm is one of the newest methods presented. G-DBSCAN [29] has two main phases which are both parallelized: (1) Making graphs: each data is a vertex and when the distance between two data is less than a predefined quantity, edge is formed between them. (2) Determining clusters: by the use of Breadth First Search (BFS) algorithm which is made on the graph in previous step. The results show that G-DBSCAN is 112 times faster than its sequential version.
MapReduce clustering algorithms
Although parallel clustering algorithms improve scalability and efficiency. However, complexity and difficulty of storage and processor distribution among different machines still remains as the main challenge. To address this issue, the MapReduce framework was released. In this section, we review algorithms that use MapReduce paradigm and architecture. Figure 4 illustrates the architecture of MapReduce framework. This framework used to solve massive computational problems on large scale data in distributed computing environments.
The algorithms discussed in this section are assests according to the following three aspects: (1) Speed up: It is defined as the ratio of application’s execution time on a single processor to the execution time, of the same workload, on a system composed on P processors. (2) Scale up: This is measured as the ratio of a system with X times larger is capable of doing a job with X times larger in the same exact time, and (3) Size up: Is the ratio of the volume of data, on the execution time.
PKMeans [32] is the distributed version of K-Means [33] clustering algorithm. The main goal of the K-Means algorithm is clustering the set of data to K clusters such as data in each cluster have the most similarity with each other, while data in different clusters have the most difference with each other. This algorithm, first selects K data randomly and repeats the following two phases consequently: assigning each data to the nearest cluster, And updating cluster centers with the mean of data in each cluster.
PKMeans distributes computations among different machines to increase speed up and scale up. Initial clustering is done in the map phase and the final clustering in the reduce phase. PKMeans has linear size up and speed up. It has the scale up of 0.75 for four machines. Moreover, PKMeans has the same quality as the sequential version of it.
MR-DBSCAN [34] algorithm is the MapReduce based version of DBSCAN algorithm. The main challenges in the parallelized DBSCAN are failure in load balance among machines, and failure in scale up; because most of its essential functions cannot be parallelized. In MR-DBSCAN, a new mechanism for partitioning and segmentation of computations has been considered, So that most essential functions could be parallelized. Experiments on large datasets have showed its efficiency and scale up.
In [35], an algorithm is presented to cluster big datasets that the volume transferred among machines has been reduced. This algorithm is divided into three phases: (1) In each machine, data clusters are searched by Canopy algorithm [36] to identify the best clusters. Because data shapes may not be spherical, instead of considering one point as the symbol of the clustering, several points are used to support the shape complexity of data. Mahalanobis distance [37] is used to specify these points. Then, only these points are transferred among machines for final clustering. (2) A weighted clustering is performed over the points specified in the previous phase in different machines. (3) A bayesian classification algorithm [37] is used to calculate the probability that a points belongs to a cluster.
Possibility C-Means algorithm (PCM) [38] is one of the known clustering algorithms in data mining. A weighted PCM algorithm (wkPCM) is presented in [39] for clustering massive data distributedly using MapReduce. Experiments showed this algorithm is capable of clustering big data in an acceptable time. Both time and spatial complexities for this algorithm is \(O(n^2)\).
In [40] the iterative clustering K-Means [41] (IKM) is presented that uses MapReduce. In the map phase, IKM runs on the loaded data segmentation, then, in the reduce phase, IKM algorithm again runs on the results taken from map phase. It is claimed that dataset needs just one scan. Therefore, the data transfer volume between map and reduce phases is very low.
In [42], improved map-shuffle-reduce version is used to present a clustering algorithm for spatial big datasets. In the map phase, dataset is divided into very small segments and the closest neighbor to each data (one scan) is specified. In the shuffle phase, the results in the previous phase are transferred to the reduce phase, orderly. Finally, in reduce phase, all points inside each cluster are averaged to calculate new cluster centers.
[43] introduces the design and implementation of a density-based clustering algorithm. It present a parallel Shared Nearest Neighbor (SNN) clustering algorithm using the k-dimensional tree (k-d tree) to reduce search time to improve efficiency.
[44] presents a new meta-heuristic based clustering method to solve the big data clustering. It leverages the searching potential of military dog squad to find the optimal centroids and Map-Reduce architecture to handle the big data sets.