An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Mahmud, Mohammad Sultan; Huang, Joshua Zhexue; Ruby, Rukhsana; Wu, Kaishun

doi:10.1186/s40537-023-00709-4

Research
Open access
Published: 01 April 2023

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Mohammad Sultan Mahmud^1,2,
Joshua Zhexue Huang^1,2,
Rukhsana Ruby³ &
…
Kaishun Wu^1,2

Journal of Big Data volume 10, Article number: 40 (2023) Cite this article

3710 Accesses
4 Citations
Metrics details

Abstract

Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partition-based Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the I-niceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.

Introduction

In this paper, we propose an ensemble method for estimating the number of clusters in a big dataset, the most important parameter in many clustering algorithms like k-means. Because this parameter, often unknown in unlabeled data, is often guessed by the user, incorrect guesses result in inaccurate clustering results. Therefore, finding this number can improve the clustering result. However, automatic identification of the number of clusters in a big dataset is a challenge to the classical methods, e.g., Elbow [1], Silhouette [2], Gap statistic [3], and I-nice [4], due to the data size and the complexity of the inherent clusters in the data. A strategy we take here is to use multiple samples of the big dataset to estimate several possible values of the number of clusters and then ensemble the multiple results to improve the final estimate.

Since it is hard or impractical to investigate a big dataset entirely, using a random sample to compute an approximate result for the whole big dataset often becomes imperative [5, 6]. Random sampling is also a popular technique widely used by data scientists to quickly gain insights from a big dataset, despite theoretical and empirical evidence of the benefits of other sampling techniques [7]. However, sampling a big dataset is an error-prone and inefficient process when the dataset cannot be held in memory. Furthermore, to obtain accurate approximations, it is essential to efficiently select random samples from a big dataset, and in the meantime, to guarantee the quality of selected samples. As such, we adopt the random sample partition (RSP) [8] data model to represent a big dataset, which allows the block-level sampling methods to be used to efficiently select multiple random samples from the big dataset. The critical technical challenges of partitioning and sampling big datasets are highlighted in [9].

Ensemble approach is widely used to tackle the problems of clustering complex data. For clustering big complex datasets, developing scalable and appropriate ensemble methods is the main challenge. The classical clustering ensemble methods combine the outcomes from different models or algorithms on the same dataset to produce an ensemble result, but they are only appropriate for small or moderate-sized datasets. For ensemble clustering of a big dataset, it is required to ensemble the results obtained from different disjointed random samples of the big dataset. In this case, each object can only appear in one clustering result, and the object identifications in different clustering results are lost. The classical clustering ensemble methods are no longer applicable. Therefore, it is necessary to investigate an appropriate ensemble model with new integration functions to ensemble the clustering results from the disjoint random samples.

In response to this, in this paper, we investigate a new clustering ensemble method that is data-adaptive and approximate in estimating the number of clusters. For large-scale data clustering, we are aimed to developing a feasible distributed clustering algorithm that (i) incorporates with a scalable serial algorithm effectively, (ii) runs efficiently on the distributed platform, and (iii) does not require processing the entire dataset. To achieve this goal, we propose a new ensemble algorithm for estimating the number of clusters in a big dataset using multiple random samples. We name our algorithm RSPCE, representing the abbreviation of RSP-based Centers Ensemble. The RSPCE algorithm includes the following steps: (1) division of a big dataset into subsets of random samples, called RSP data blocks, which form the RSP data model; (2) random selection of a subset of RSP data blocks and identification of the number of clusters and the initial cluster centers in each RSP data block by the I-niceDP [10] algorithm; and (3) generation of the final ensemble result as a set of initial centers of K clusters by the RSPCE ensemble method that uses the cluster ball model to merge clusters in random samples which are likely sampled from the same cluster in the big dataset. Unlike the classical ensemble clustering methods, the RSPCE algorithm does not depend on common object ids to ensemble the component clustering results.

We conducted experiments on both synthetic and real-world datasets. The experiment results have shown that the new method is computationally effective and efficient in finding the number of clusters in a big dataset and their initial cluster centers. The experiment results also show that the RSPCE algorithm produces good approximations of the actual numbers of clusters in both synthetic and real-world datasets.

The rest of the paper is organized as follows: In Sect. Related work, we briefly discuss existing work on large-scale data clustering processes. In Sect. Ensemble method for estimating the number of clusters, we introduce preliminaries in the context of this work. In Sect. The proposed RSPCE algorithm, we present the proposed RSPCE scheme. In Sect. Experiments, we evaluate the performance of the proposed scheme through experimental results. Finally, Sect. Conclusions concludes the paper.

Related work

The number of clusters is an important parameter in many well-established clustering algorithms (e.g., k-means, k-medoids). This number is unknown in unlabeled datasets, and it is often guessed by the user. The Elbow [1], Silhouette coefficient [2], and Gap statistic [3] are well-known methods for finding the “right” number of clusters in a dataset. These methods identify the number of clusters in a dataset by measuring the quality of several clustering results with different numbers of clusters. However, these methods do not work well on big datasets with a large number of clusters, and they are computationally expensive to use because multiple clustering results must be generated.

I-nice [4] is eminent among the density-based clustering algorithms for estimating the number of clusters that can produce high-quality initial seeds on small datasets. In the I-nice algorithm, the observation points are assigned to the data space for observing the dense regions of clusters in data through the distance distributions between the observation points and objects. Then, to find the number of peaks in a distance distribution, multiple gamma mixture models (GMMs) are built with different components, and the GMM is solved via the EM algorithm. The minimum Akaike information criterion (AICc) is used to select the best-fitted model and the observed largest number of components as the number of clusters.

Automatic clustering algorithms are attracting more attention from the academic community, e.g., density-based clustering and data depth clustering [11, 12]. Density-based algorithms, such as DBSCAN, can cluster datasets with convex shapes and noisy objects, but it is difficult to determine the density threshold [13]. The depth difference method [14] estimates the depth within clusters, the depth between clusters, and the depth difference to finalize the optimal value of K. However, for datasets with complex decision graphs, it is difficult to correctly identify clustering centers.

In practice, according to data preprocessing strategies, big data clustering processes can be classified as sampling-based, incremental, condensation-based, and divide-and-conquer strategy-based. Sampling-based techniques typically select a subset of a given dataset, employ only the sampled subsets to find the number of clusters, and then allow the remaining data to obtain the final outcome [15, 16]. The success of sample-based methods depends on the premise that the chosen representative samples have significant information about the dataset. Incremental approaches, on the other hand, reduce the computation time by scanning the data points only once [17, 18]. For fast clustering, modified global k-means [19] and multiple medoids-based fuzzy clustering [17] methods have been developed based on the idea of incremental clustering. Condensation-based methods speed up the performance by encapsulating the data into a special data structure, such as trees and graphs [20, 21]. Divide-and-conquer strategies split the big dataset into several subsets or sub-spaces that can fit into the memory. Later, the clustering algorithms are applied to these subsets or sub-spaces independently (please see [22,23,24]). The final clustering results are obtained by merging the partial clusters of subsets or sub-spaces.

For big data clustering, a bootstrap method [25] was proposed to estimate the number of clusters, which minimizes the corresponding estimated clustering instability. The kluster [26] procedure takes randomly selected clusters as the initial seeds to determine the final number of clusters in a dataset in an iterative manner, which confirms the most frequent mean of the resulting clusters from the iterations as the optimal number of clusters. On the other hand, the X-means method [27] automatically determines the number of clusters based on Bayesian information criterion (BIC) scores. At each iteration, this method executes local decisions about the selection of the current center split to better fit the data. In the same vein, coresets [28] have been constructed to scale to massive datasets with the distributed clustering idea.

Although many clustering ensemble techniques have been developed (such as [22, 23, 29,30,31]), they either produce incorrect results or are inefficient for use in big data applications. The traditional methods [1,2,3] for finding the K value run the clustering algorithm several times with a different K value for each run, which is not suitable for big data analysis. We observe that there are two limitations to the traditional clustering approaches. First, in a typical scenario, the algorithms require the number of clusters in advance for the clustering process. Second, traditional algorithms operate at the object level and are incapable of dealing with clustering ensembles with big data and large ensemble sizes. Furthermore, the classical ensemble technique combines the results of different models or algorithms on the same dataset to produce a robust result, where a scalable method is required to identify the number of clusters in a big dataset. We focus on the data-subset clustering ensemble technique, which is an approximate computing method to estimate the correct clustering outcome from the subsets of a big dataset.

Ensemble method for estimating the number of clusters

In this section, we propose a new method that uses multiple random samples of a big dataset to identify the number of clusters. We first give the definition of a random sample from a big dataset and define the random sample partition data model to represent a big dataset as a partition of random samples. Then, we present the I-niceDP method for finding the number of clusters in a random sample. Finally, we propose a ball model for integrating the results of multiple random samples into the ensemble result as the number of clusters in the big dataset.

Multiple random samples of a big dataset

The existing methods are computationally infeasible for finding the number of clusters in a big dataset. Instead, using a random sample to estimate is an acceptable choice. However, a large sample could possibly result in an accurate estimate but will be computationally expensive. An alternative is to use multiple random samples of smaller size for this purpose. In this case, we need to solve the problems of drawing multiple random samples from a big dataset effectively and efficiently, and ensemble the results of those multiple random samples into the final result.

Definition 1

(Random sample of a big dataset) Let D be a subset of big dataset ${\mathbb {D}}$, i.e., $D \ \subset {\mathbb {D}}$. D is a random sample of ${\mathbb {D}}$ if

$$\begin{aligned} F(D) \approx F({\mathbb {D}}) \end{aligned}$$

(1)

where F() is the cumulative distribution function.

The simple random sampling process can be used on ${\mathbb {D}}$ to generate D, which satisfies this definition. However, sampling a distributed big data file to generate multiple independent random samples is a time-consuming process. In this work, we use the random sample partition data model to represent a big dataset as a set of random sample data blocks, so the block-level sampling method is used to efficiently select multiple random samples.

The random sample partition (RSP) [8] is defined as follows:

Definition 2

(Random sample partition of a big dataset) Let ${\mathbb {D}}$ be a big dataset and $\{D_1,D_2,...,D_m\}$ be a set of m random samples of ${\mathbb {D}}$. $\{D_1,D_2,...,D_m\}$ is a random sample partition of ${\mathbb {D}}$ if

$D_i \ne \emptyset$
$D_i\cap D_j = \emptyset$
$\bigcup _{i=1}^{m}D_i ={\mathbb {D}}$
$F(D_i) \approx F({\mathbb {D}}), \qquad 1 \le i \le m$

The first three conditions define a partition of ${\mathbb {D}}$, whereas the last condition categorizes a random sample partition.

In the random sample partition, all RSP data blocks $\{D_1,D_2,...,D_m\}$ satisfy the definition of a random sample of ${\mathbb {D}}$. To generate multiple random samples, we simply randomly select a few RSP data blocks from the RSP data model without going through all the records of ${\mathbb {D}}$ several times. As a result, the sampling process of multiple random samples is improved significantly.

Finding the number of clusters in a random sample

Given a set of random samples as a set of RSP data blocks $\{D_1, D_2,..., D_b\}$ where $(b < m)$, one important step is to find the number of clusters in each sample $D_i$. In this work, we are not only interested in the number of clusters K, but also the initial centers of the K clusters in each subset $D_i$. For this reason, we first use the density peak-based algorithm $\text{ I-niceDP }$ [10] to compute K and the initial centers of clusters in a random sample. Then, we use the k-means algorithm to cluster the random sample and refine the cluster centers. The $\text{ I-niceDP }$ operator is defined as follows:

$$\begin{aligned} (k_i, C_i)=\text{ I-niceDP }(D_i), \qquad 1 \le i \le b \end{aligned}$$

(2)

where $\text{ I-niceDP }$ is an operator on random sample $D_i$, and $(k_i, C_i)$ are the two return values of the function. The first term is the number of clusters in $D_i$, and the second term $C_i= \{c_i, c_2,...,c_{k_i}\}$ is the set of centers of the k clusters.

After $(k_i, C_i)$ are obtained from $D_i$, they are used as the input parameters to the k-means algorithm to compute k refined cluster centers in $D_i$ as follows:

$$\begin{aligned} C_{i}^{*}=k\text{-means }(k_i, C_i, D_i), \qquad 1 \le i \le b \end{aligned}$$

(3)

where $C_{i}^{*}=\{c_i^{*}, c_2^{*},...,c_{k_i}^{*}\}$ is the set of the refined centers of $k_i$ clusters in $D_i$.

Applying the two operators $\text{ I-niceDP }$ and $k\text{-means }$ to all b random samples $\{D_1, D_2,..., D_b\}$, we obtain b sets of refined centers and make union of these sets to form a new set as

$$\begin{aligned} C^{*}=C_{1}^{*} \cup C_{2}^{*} \cup ... \cup C_{b}^{*}, \end{aligned}$$

(4)

where $C_{i}^{*}$ is the result of $k\text{-means }(k_i, C_i, D_i)$.

The set $C^{*}$ contains totally $K=\sum _{i=1}^b{k_i}$ cluster centers from b random samples. Since the random samples are taken from the same big dataset, they should have similar inherent clusters. Therefore, the numbers of clusters in them should be very close to each other, and the centers of clusters in different random samples should also be located closely. Considering these properties, in the next subsection, we propose a method, called the ball model, to be used to aggregate the nearby centers in $C^{*}$ into an ensemble set of centers as the initial cluster centers in the big dataset.

Ball model for representing clusters

Given a random sample $D_i$, the k-means operator of (3) generates a set of $k_i$ clusters. The centers of clusters in the same random sample should be separate, but the centers of clusters in different random samples can be very close to each other. In this case, the two clusters in the two random samples may represent the same cluster in the big dataset. Therefore, the two centers should be merged into one, indicating the same cluster of the big dataset. In this subsection, we will determine whether two clusters in two different random samples represent the same cluster in the big dataset.

Since the k-means clustering process produces spherical clusters, we propose a ball model to represent a spherical cluster as a ball. The main features of this ball model are defined below.

Definition 3

(Radius of a cluster ball) Let $C_i$ be a cluster of n points, and $c_i$ the center of the cluster. The radius of the cluster ball, $r_i$, is defined as the average of the distances between all points in $C_i$ and its center $c_i$ below

$$\begin{aligned} r_i=\frac{1}{n}\sum _{i=1}^{n} \Vert d \left( c_i -x_i \right) \Vert , \end{aligned}$$

(5)

where $x_i$ is an object in $C_i$.

We use the average distance to define the radius of cluster ball to reduce the impact of outliers, i.e., few points faraway from the cluster center.

Definition 4

(Cluster ball) Let $C_i$ be a cluster. Its cluster ball, denoted as $\text{ CB}_i$, is defined as a 3-tuple

$$\begin{aligned} \text{ CB}_i=(C_i,c_i,r_i), \end{aligned}$$

(6)

where $c_i$ and $r_i$ are the center and radius of cluster $C_i$, respectively. Note that, in the literature, “centers” and “centroids” are used, alternatively and represent the same meaning.

We can use cluster balls to determine whether two clusters are well separated or overlapping.

Definition 5

(Well-separate and overlapping clusters) Let $\text{ CB}_i$ and $\text{ CB}_j$ be two balls of clusters $C_i$ and $C_j$, respectively. We say that $C_i$ and $C_j$ are well-separated if $\text{ CB}_i$ and $\text{ CB}_j$ are disjoint. If $\text{ CB}_i$ and $\text{ CB}_j$ intersect, we say that $C_i$ and $C_j$ are overlapping.

Figure 1 illustrates an example of merging the clusters from two random samples, RSP block i in Fig. 1a and RSP block j in Fig. 1b. Three clusters are found in each random sample and represented as three cluster balls. Figure 1c shows clusters $C_1$ and $C_6$ intersect, and clusters $C_3$ and $C_5$ intersect, whereas clusters $C_2$ and $C_4$ are separated from others. Two overlapping clusters are likely to be the same cluster of the big dataset. The definition below defines a property to merge two overlapping clusters.

Definition 6

($\frac{1}{2}$-ball intersection property) Let $\text{ CB}_i$ and $\text{ CB}_j$ be two cluster balls. We say that $\text{ CB}_i$ and $\text{ CB}_j$ have $\frac{1}{2}$-ball intersection property if $\Vert c_i-c_j \Vert \le \frac{1}{2}(r_i+r_j)$, where $(r_i, r_j > 0)$. If two cluster balls have a $\frac{1}{2}$-ball property, they are strongly proximal.

This property is used to determine whether two clusters found from different random samples indicate the same cluster of the big dataset. If so, they can be merged into one cluster. If two clusters indicate the same cluster in the big dataset, their cluster balls must intersect and satisfy the $\frac{1}{2}$-ball intersection property. In this case, the intersected cluster balls in random samples are merged to obtain the optimal number of clusters in the big dataset. For example, in Fig. 1 (c), cluster balls $\text{ CB}_1$ and $\text{ CB}_6$ satisfy this property, so they are likely sampled from the same cluster of the big dataset and need to merge into one cluster.

Ensemble method for merging clusters with ball model

Using Definition 6, we can integrate the set of cluster centers in $C^{*}$ into the ensemble set of centers as the initial cluster centers of the big dataset. This process is carried out as follows:

1
Randomly select a center $c_p^{*}$ from $C^{*}$. Make the center $c_p^{*}$ as a candidate of the final centers in the set of $\text{ CF }$, i.e., final set of centers. Compute the cluster ball $\text{ CB}_p^{*}$ and remove $c_p^{*}$ from $C^{*}$.
2
Randomly select a center $c_q^{*}$ from $C^{*}$ and compute the cluster ball $\text{ CB}_q^{*}$.
3
Compute the $\frac{1}{2}$-ball intersection property of the two cluster balls $\text{ CB}_p^{*}$ and $\text{ CB}_q^{*}$.
4
If the two balls are disjoint, ignore the second ball $\text{ CB}_q^{*}$ and go to Step 2; otherwise, if the two balls do not satisfy the $\frac{1}{2}$-ball intersection property of Definition 6, ignore the second ball $\text{ CB}_q^{*}$ and go to Step 2; otherwise, add the cluster ball in the set $\text{ CF }$. If all centers in $C^{*}$ have been tested, go to next step; otherwise, go to Step 2.
5
Merge the centers in $\text{ CF }$ by computing the mean of the centers as the center of a cluster in the big datset and go to Step 1 until the centers of all clusters in the big dataset are found.

Since we have included the radius of the cluster ball in the ball model, it is straightforward to test whether two cluster balls are disjoint or not. However, when two cluster balls intersect, Definition 6 plays an important role in determining the merge of two clusters. The intersection of two cluster balls satisfying the $\frac{1}{2}$-ball intersection property is a much stronger indication that the two clusters could be the same cluster in the big dataset.

The proposed RSPCE algorithm

In this section, we present the algorithms used in the basic steps of the RSPCE algorithm for estimating the number of clusters in a big dataset and finding the initial cluster centers. The basic steps are summarized as follows:

1
Given a big dataset, generate its RSP data model for efficiently selecting multiple random samples.
2
For each random sample, find the number of clusters and the centers of the clusters.
3
For given two clusters, use the ball model to determine whether two clusters could be the same cluster in the big dataset.
4
Use the ball model to integrate the clusters from multiple random samples into an ensemble set of clusters and initial cluster centers.

In the following, we present the algorithms in each step and give a complexity analysis of the RSPCE algorithm.

Algorithm for generating multiple samples

In this work, we use the random sample partition (RSP) data model to convert a big dataset into a set of disjoint random sample data blocks, so that each data block is used as a random sample of the big dataset. Therefore, to identify the number of clusters in a big dataset, we use Algorithm 1 to convert it to a set of RSP data block files for random sample selection.

The inputs of the algorithm are a big dataset ${\mathbb {D}}$ and the size of each RSP data block n. The output is a set of m RSP data blocks, where $m=N/n$, which are saved as a set of RSP data block files $\{D_1, D_2,..., D_m\}$.

Algorithm 1 is executed as follows: Lines 2-3 compute the number of objects N and the number of RSP blocks m. Line 4 generates a sequence of N unique random numbers following a uniform distribution. Line 5 appends the sequence of random numbers as one additional id in ${\mathbb {D}}$. Line 6 randomizes the records of ${\mathbb {D}}$ by sorting the records on the random number id. Lines 8-11 cut the sequence of the randomized records of ${\mathbb {D}}$ sequentially into m sub-sequences, each being written as an RSP data block file.

Algorithm for finding the number of clusters in a random sample

We model clusters in a big dataset as normal distributions where each cluster has a high density area, which is reflected as the density peak of the normal distribution. Therefore, the number of clusters in a dataset is corresponding to the number of density peaks in the dataset. The $\text{ I-niceDP }$ algorithm [10] (i.e., an improved version of I-nice) was designed for identifying the number of clusters in a dataset by finding the number of density peaks in the distance distribution of objects in the dataset with respect to an observation point. Therefore, $\text{ I-niceDP }$ is chosen as the operator to identify the number of clusters in each random sample. The pseudo-code of the $\text{ I-niceDP }$ algorithm is presented in Algorithm 2. The input to the algorithm is an RSP data representation of a big dataset and the number of RSP data blocks b. The output is a set of b values indicating the numbers of clusters found in the b random samples and b sets of cluster centers.

The $\text{ I-niceDP }$ algorithm is explained below. First, Line 2 randomly selects b RSP data blocks. Starting from Line 3, each RSP data block is computed separately as follows:

1
Line 4 generates an observation point as a reference for computing the distance distribution of objects.
2
Line 5 computes a set of distances between the observation point and the data points of an RSP sample to form the distance vector.
3
Line 6 computes the maximal number of GMM components ${\mathcal {M}}_{\max }$ using the kernel density estimation (KDE) method, where $\Delta _1$ and $\Delta _2$ are two thresholds that control the potential number of components.
4
Lines 7-10 compute a set of GMMs from the distance vector with the number of components smaller than or equal to the maximal number ${\mathcal {M}}_{\max }$, and each GMM model is built using the EM algorithm.
5
Lines 12-14 select the most fitted model based on the AICc criterion.
6
Lines 15–17 determine the high-density data points for each GMM component using the density peak mechanism, and these high-density data points are used as the initial cluster centers.
7
Finally, lines 18-19 assign the initial cluster centers to the k-means algorithm to cluster the input data and refine the cluster centers as the output result of the random sample.

After all RSP data blocks are computed, the set of refined cluster centers as defined in Eq. (4) is generated.

Algorithm for identifying two clusters being one using ball model

The $\text{ I-niceDP }$ algorithm generates a set of clusters from b random samples. Some of these clusters are likely sampled from the same cluster of the big dataset, so they have to be merged into one cluster as an approximation of the true cluster in the big dataset. Algorithm 3 is designed to use the ball model to identify the two clusters which are likely to be one cluster in the big dataset.

The inputs to the algorithm are two clusters $C_i$ and $C_j$ and their cluster centers $c_i$ and $c_j$. Line 2 computes the radii of the two clusters $r_i$ and $r_j$. Lines 3-4 build two cluster balls $\text{ CB}_i$ and $\text{ CB}_j$. Line 6 checks if the two balls are disjoint, set $\text{ Merge } \text{= } \text{ false }$. Lines 7-8 check if the two balls overlap, set $\text{ Merge } \text{= } \text{ false }$; otherwise, set $\text{ Merge } \text{= } \text{ true }$ in Line 10. Output $\text{ CB}_j$ if $\text{ Merge } \text{= } \text{ true }$; otherwise, output nothing.

Algorithm for ensembling the numbers of clusters in multiple samples

Finally, the pseudo code of the RSPCE algorithm is illustrated in Algorithm 4. The inputs are a big dataset ${\mathbb {D}}$ and the sample size n. First, in Line 2, Algorithm 1, shown as the operator RSP(), is called to convert ${\mathbb {D}}$ into a set of m RSP data blocks. Line 3 randomly selects b RSP blocks. Lines 4–8 call Algorithm 2 as operator $\text{ I-niceDP }()$ and operator $k\text{-means }()$ to compute the set of initial cluster centers $C^{*}$. Lines 11–13 call Algorithm 3 to find the clusters which are likely to be the same cluster. Line 14 merges the clusters as one and adds it to the set of the final cluster centers. Line 16 counts the number of the final cluster centers, CF. Finally, the algorithm outputs the number of clusters and the set of cluster centers.

Figure 2 illustrates the results of the three steps of the RSPCE algorithm. Figure 2a shows all refined centers found from 6 random samples of dataset DS1. We can see the sets of clusters from the 6 random samples are very similar. Figure 2b plots all cluster balls, and Fig. 2c shows the final set of cluster centers which are close to the true centers.

Complexity analysis

Given a big dataset with N objects, we have the following major parts that need to be considered: generating an RSP data representation, randomly selecting a subset of random samples; finding the number of clusters of each random sample, using the k-means algorithm to refine the initial cluster centers of each random sample, and finally, using the ball model to ensemble the results of the multiple random samples.

Suppose the number of objects in each random sample is n, and b samples are randomly selected from m, where $b < m$. The random sample generation operation has a complexity of ${\mathcal {O}}(n\log (N/n))$. I-niceDP algorithm generates O one-dimensional data and density peaks, and hence the time complexity of this algorithm is ${\mathcal {O}}(bnOK)$, where O is the number of observation points. The complexity of k-means algorithm is ${\mathcal {O}}(bnTdK)$, where T is the maximum number of iterations. The time complexity of cluster ball learning process is ${\mathcal {O}}(bK)$. Therefore, the overall complexity of the RSPCE algorithm is ${\mathcal {O}}(n \log (N/n) + bnOK + bnTdK + bK )= {\mathcal {O}}(n\log (N/n) + (nO+nTd+1)bK)$, which is linear to the number of data blocks b.

The RSPCE algorithm is implemented in a distributed platform with Q nodes, the computational complexity of the RSPCE algorithm can be reduced to $({\mathcal {O}}(n\log (N/n) + (nO+nTd+1)bK)) / Q$. Therefore, the proposed RSPCE algorithm is efficient and scalable.

Experiments

A series of experiments were conducted on both synthetic and real-world datasets to demonstrate the performance of the proposed RSPCE algorithm and show its practical efficiency. In this section, the datasets and the experiment settings are presented. Evaluation measures are defined. The experiment results are analyzed, and the homogeneity of the results is discussed. Finally, the computational efficiency and scalability of the algorithm are demonstrated.

Datasets

The characteristics of the synthetic and real-world datasets used in the experiments are summarized in Table 1 and described below:

Synthetic datasets. Five synthetic datasets, named DS1 to DS5, were generated in dimensions of 2 and 10 with different numbers of clusters in multivariate normal distributions. The numbers of clusters, the sizes of each cluster, the dimensions and total objects in these datasets are given in Table 1.
Real-world datasets. Four real-world datasets used in the experiments are the following: Covertype^{Footnote 1} dataset with 581,012 objects describes 7 forest cover types in 54 different geographic measurements. There are 84% of objects in 2 types (type-1 36.5% and type-2 48.7%). The rest 16% of the objects are in other 5 types. KDD’99ID^{Footnote 2} dataset with about 5 million objects describes the connections of sequences of network intrusion detection. It has 23 classes, and 98.3% of the dataset belong to 3 classes (normal 19.6%, neptune 21.6%, and smurf 56.8%). PokerHand^{Footnote 3} dataset has more than 1 million objects, each being an example of a hand consisting of five playing cards drawn from a standard deck of 52. The dataset has 10 predictive features and 10 classes with two dominant classes accounting for over 90% of the samples (nothing in hand 49.9% and one pair 42.4%). SUSY^{Footnote 4} dataset was generated with Monte Carlo simulations. It has 5 million objects, 18 features and 2 classes.

Table 1 Characteristics of the datasets (d: dimensions, N: number of objects, K: number of clusters or classes)

Full size table

Table 2 Parameter settings used in the experiments

Full size table

Experiment settings

In the experiments, all datasets were converted to RSP data representations, i.e., each dataset being transformed into a set of RSP data blocks. The left column of Table 2 shows the percentages of the total RSP blocks used to estimate the number of clusters. In the experiments, six different sizes of subsets of RSP blocks were used. The right column shows the two block sizes used to partition the synthetic datasets and the real-world dataset Covertype. The other three real-world datasets were partitioned with the block sizes of {1, 2, 5, 10, 15, and 20}% of the whole datasets. Therefore, each dataset is transformed into more than one RSP representation.

Six existing methods were selected for comparison of the performance of the proposed RSPCE algorithm. They are nselectboot [25], kluster [26], X-means [27], Elbow [1], Silhouette [2], and Gap statistics [3]. The number of clusters K in the last four methods was assigned to $K=2~\text{ to }~100$. For the bootstrap method of kluster, the number of the bootstrap samples was set to 20.

The experiments were performed on three local nodes equipped with x64-based processor, Intel(R) core i7–7700, CPU 3.60Hz, 8 GB of memory, and 1 TB of storage. The RSPCE algorithm was implemented in Python$-$3.7.3. with py2r, fpc, densityClust and clvalid R packages. Three observation points were used in the step of I-niceDP of the RSPCE algorithm.

Evaluation metrics

The internal and stability measures below were used to evaluate the results of the comparison methods and the RSPCE algorithm.

Internal measures

The following internal measures were used to evaluate the compactness, connectivity, and separation of the cluster partitions.

Inertia or within-cluster sum-of-squares (SSE) measures the internal coherence of objects in a cluster [32]. The lower the inertia value, the better the cluster. Zero is optimal.

The silhouette coefficient (SC) [2] evaluates the clustering quality by combining the ideas on how well the clusters are separated (i.e., separation) and how compact are the clusters (i.e., tightness). The SC of clusterings is computed as follows:

$$\begin{aligned} \text{ SC } = \frac{b-a}{\max (a,b)} \end{aligned}$$

(7)

where a is the average distance between a cluster and all other data points in the same cluster, and b is the average distance between a cluster and all other data points in the nearest cluster. We can calculate the average SC as the mean of the SC for all samples. A higher score closer to 1 is related to a model with better-defined clusters.

Davies-Bouldin index (DBI) [33] is an internal evaluation metric, which is used to validate the clustering process using quantities and data points residing in the dataset. The DBI for K clusters is defined as

$$\begin{aligned} \text{ DBI } (K) = \frac{1}{K} \sum _{i=1}^{K} \underset{i \ne j}{\max }\ {\frac{\Delta (C_i)+ \Delta (C_j)}{\delta (C_i, C_j)}} \end{aligned}$$

(8)

where $\delta (C_i, C_j)$ is the inter-cluster distance, i.e., the distance between clusters $C_i$ and $C_j$, $\Delta (C_i)$ is the intra-cluster distance of cluster $C_i$, i.e., distance within the cluster $C_i$. The lower the DBI value, the better the clustering result.

It is reasonable to define some intuitive metrics using conditional entropy analysis under the ground truth class assignments information. V-measure [34] is an entropy-based measure that explicitly measures how successfully the criteria of homogeneity and completeness are satisfied. V-measure is computed as the harmonic mean of distinct homogeneity and completeness scores. Homogeneity captures only the information of the members in a single class for each cluster, whereas completeness captures the information of all members of a given class assigned to the same cluster. V-measure is equivalent to normalized mutual information (NMI) metric.

The adjusted rand index (ARI) [35] and the adjusted mutual information (AMI) [36] are also used to evaluate the performance of the RSPCE algorithm. These two measures are defined as follows:

Given the ground truth result $P=\{C_1, C_2,..., C_k\}$ with K clusters and the predicted result $P'=\{C'_1, C'_2,..., {C'}_{k'}\}$ with $K'$ clusters, the adjusted rand index (ARI) [35] measures the similarity of the two assignments defined as

$$\begin{aligned} \text{ ARI }(P, P')=\frac{\sum _{i=1}^{K} \sum _{j=1}^{K'} \left( \begin{array}{c}|C_i \cap C'_j|\\ 2\end{array}\right) - X_3}{\frac{1}{2}(X_1+X_2)-X_3} \end{aligned}$$

(9)

where

$$\begin{aligned} X_1=\sum _{i=1}^{K}\left( \begin{array}{c}|C_i|\\ 2\end{array}\right) , X_2=\sum _{j=1}^{K'}\left( \begin{array}{c}|C'_j|\\ 2\end{array}\right) , X_3=\frac{2X_1 X_2}{n(n-1)} \end{aligned}$$

(10)

where $i\in \{1,..., K\}, j\in \{1,..., K'\}, n$ is the total number of data samples, and $\left| . \right|$ denotes the cardinality of the cluster. ARI value varies between zero and one. The higher value indicates that the resulted clustering outcome is more close to the actual one.

AMI [36] is defined as

$$\begin{aligned} \text{ AMI }(P, P')= \frac{\text{ NMI}_{max}(P, P')- {\textbf{E}}\{\text{ NMI}_{\max }(P, P')\}}{1-{\textbf{E}}\{\text{ NMI}_{\max }(P, P')\}} \end{aligned}$$

(11)

where

$$\begin{aligned} \text{ NMI }(P, P')= & {} \frac{2\varphi (P;P')}{\phi (P)+\phi (P')} \end{aligned}$$

(12)

$$\begin{aligned} \varphi (P;P')= & {} \sum _{i}\sum _{j}\frac{|C_i \cap C'_j |}{n} \log \frac{n\left| C_i \cap C'_j \right| }{|C_i||C'_j|} \end{aligned}$$

(13)

$$\begin{aligned} \phi (P)= & {} -\sum _{i}\frac{|C_i|}{n} \log \frac{|C_i|}{n} \end{aligned}$$

(14)

$$\begin{aligned} \phi (P')= & {} -\sum _{j}\frac{|C'_j|}{n} \log \frac{|C'_j|}{n} \end{aligned}$$

(15)

AMI values are between 0 and 1. The higher the AMI value, the better the quality of clusters.

Stability measures

The average proportion of nonoverlap (APN) and the average distance between means (ADM) [32] were used to measure the stability and consistency of the results by comparing the ground truth of the entire dataset with the obtained number of clusters in a sample.

Let $C_i$ represent the true clusters via an ideal clustering process, and $C'_j$ be the clusters on the b random samples. Given the total number of clusters K, the APN measure is defined as

$$\begin{aligned} \text{ APN } (P,P') = \frac{1}{KK'}\sum _{i=1}^{K}\sum _{j=1}^{K'} \left( 1- \frac{n(C'_j \cap C_i)}{n(C_i)} \right) . \end{aligned}$$

(16)

The APN resides in [0, 1], and the value close to zero corresponds to highly consistent clustering results.

The ADM computes the average distance between cluster centers determined based on the entire dataset and the random samples. It is defined as

$$\begin{aligned} \text{ ADM } (P,P') = \frac{1}{K K'}\sum _{i=1}^{K}\sum _{j=1}^{K'} \text{ dist }(C_i, C'_j). \end{aligned}$$

(17)

where $C_i$ is the mean of the objects in a cluster which contains object i on the entire dataset, and $C'_j$ is the predicted one defined on the random samples. This metric is based on the Euclidean distance. It also has a value between 0 and 1, and smaller values are preferred.

Experiment results and analysis

Results of the number of clusters

The first set of experiments was to use five existing methods to identify the number of clusters from random samples of the five synthetic datasets in Table 1. Two sample sizes of 5000 points and 10,000 points were used. For each random sample in a synthetic dataset, the number of clusters in the sample was discovered by the five methods. Since there are m random samples in one synthetic dataset for each sample size, m results of the number of clusters were found by each method. The heatmaps of the results of the five methods on m random samples in the five synthetic datasets with two RSP representations each are shown in Fig. 3. The columns of each figure are the numbers of clusters, and the rows are the five methods. The dark color in a cell indicates that a high percentage of the m results was identified as the number of clusters by the column with the corresponding method. For example, in the right figure of Fig. 3a, the dark cell of the second column from the right in the row of Gap statistic implies that the majority number of clusters identified by Gap statistic from 100 random samples of data DS1 is 9. From Fig. 3, we can see that Gap statistic, Silhouette and X-means performed better than Elbow and I-niceDP in identifying the number of clusters from the random samples of a big dataset. Another observation is that a bigger sample size results in a more accurate result.

Using the two-dimensional synthetic datasets of DS1 and DS2, we compared the true cluster centers with the cluster centers identified by the five methods from the random samples of the two datasets. The results are plotted in Fig. 4. We can see that the cluster centers identified by I-niceDP are closer to the true centers than those identified by the other four methods. These results indicate that I-niceDP is more capable in identifying better initial cluster centers than other existing methods, so it is chosen in the operator to identify the number of clusters from a random sample in the algorithm.

Figure 5 shows the performance of the RSPCE algorithm in identifying the number of clusters in the five synthetic datasets with two different sample sizes. The horizontal axis in each plot is the number of random samples used by the algorithm. The vertical axis shows the number of clusters identified. The horizontal straight line indicates the true number of clusters in each dataset. For the same number of random samples, the RSPCE algorithm was run 20 times on each RSP representation of a synthetic dataset. From Fig. 5, we can see that for the two-dimensional datasets DS1 and DS2 with fewer clusters, the RSPCE algorithm can easily identify the true number of clusters with a few random samples. For the high-dimensional dataset DS3 with fewer clusters, the RSPCE algorithm can also converge to the true number of clusters as the number of random samples increased. However, for the two high-dimensional datasets DS4 and DS5, the RSPCE algorithm needs more random samples to converge to the true number of clusters. Another observation in the ensemble method is that smaller samples gave better results than the bigger samples. The reason may be that more smaller samples generate more diverse results, which can improve the final ensemble result. However, more investigations are required to give a firm conclusion on this observation.

Improvements of clustering results

Having obtained the number of clusters in each synthetic dataset and the initial cluster centers by the RSPCE algorithm, we used the k-means algorithm to cluster each random sample used in the RSPCE algorithm with the number of clusters and the initial cluster centers as input parameters. For each set of random samples, we used 8 internal measures to validate the clustering results. Table 3 shows all validation results of 5 synthetic datasets with 2 sample sizes and 6 different subsets of random samples listed in column es. We can see clearly that the clustering result was improved as more random samples were used by the RSPCE algorithm. Again, smaller random sample sizes resulted in better clustering results. This observation is consistent with the one from Fig. 5.

We also investigated the stability and consistency performance of the RSPCE algorithm in detecting cluster centers. The APN and ADM measures were used to evaluate the clustering consistency by comparing the results obtained in different numbers of random samples. The results of APN and ADM scores are shown in Table 4. It appears that the APN and ADM scores tend to decrease as the number of random samples increases. Again, the RSPCE algorithm performed significantly better on the datasets with smaller numbers of clusters. The set of random samples containing $10\sim 20\%$ of the big dataset gave the better results. These results also show that the RSPCE algorithm can generate stable cluster centers which are close to the centers of true clusters in the entire dataset.

In the experiments, we observed that small samples ($n<2000$) often miss mini-clusters in the sample, or they do not have enough points to categorize small clusters. Increasing the random sample size can solve this problem, but the computing cost also increases. A tradeoff on the sample size needs considerations in practice.

Table 3 Validations of 6 internal and 2 external measures on the clustering results of 5 synthetic datasets with 2 sample sizes (A = 5000; B = 10,000) and 6 different subsets of random samples listed in column es

Full size table

Table 4 Stability validations on the results of the RSPCE algorithm on synthetic datasets with 2 sample sizes, A = 5000 and B = 10,000 (lower value is better)

Full size table

Statistical homogeneity test

Homogeneity tests were conducted to verify the cluster centers discovered from random samples. The distribution of the distances between these centers should be similar to the distribution of the distances of the true centers in the big dataset. We conducted two-samples Kolmogorov-Smirnov (KS)-test and Z-test to compare the distance distributions of cluster centers between the entire dataset ${\mathcal {G}}$ and random samples ${\mathcal {A}}$.

The null-hypothesis is that ${\mathcal {G}}$ and ${\mathcal {A}}$ have the same distribution, whereas the alternative hypothesis implies that they have different distributions. We set $h=1$ if we reject the null-hypothesis (i.e., the distributions are not the same); otherwise, we set $h=0$ in the case of accepting the null hypothesis. We tested at a significant level of 5%. The p-value is the probability of having a false rejection in the case of a null hypothesis. The corresponding test results are presented in Table 5. We can see that the null-hypothesis is accepted in all cases. Figure 6 illustrates that the test CDFs (green, blue, cyan, magenta, yellow, and black) of random samples match the empirical CDF (red) of the whole dataset closely, and the highest difference is small.

Table 5 Results of two-samples KS-test and Z-test on two distance distributions among the actual cluster centers and the estimated cluster centers of the synthetic datasets by the RSPCE algorithm

Full size table

Comparisons of RSPCE with other methods

We compared the results of the RSPCE algorithm in identifying the number of clusters from multiple random samples with the results of other methods in identify the number of clusters from one random sample. Table 6 shows the results of the synthetic datasets, and Table 7 shows the results of the real-world datasets. The sample size is 5,000 points. Different numbers of random samples, as shown in column es, were used in the RSPCE algorithm. For the same number of random samples, the RSPCE algorithm ran 20 times on different sets of random samples. Other methods were applied to each random sample to generate one result. The average value and the standard deviations from the multiple runs were calculated. We can see that the RSPCE algorithm performed the best in general for both synthetic datasets and real-world datasets.

Specifically, Silhouette, Gap statistic and the RSPCE algorithm are more accurate in identifying the number of clusters in all five synthetic datasets. Among them, the RSPCE algorithm performed best. kluster and Elbow performed well in DS1 but not well in other datasets. Neither nselectboot nor X-means performed well in all datasets. It seems that the advantage of the bootstrap method in nselectboot did not work well. nselectboot, X-means, and Gap statistics all overestimated the number of clusters. Elbow and kluster, on the other hand, underestimated the number of clusters in the last four datasets.

Moreover, RSPCE was able to identify the cluster centers more accurately. Except for the Silhouette and Gap statistic methods, none of them were able to identify the cluster centers of five synthetic datasets. The Silhouette and Gap statistic algorithms are Euclidean distance-based, and hence computationally expensive.

We examined the effectiveness of the RSPCE algorithm on four real-world datasets. The number of classes in these datasets were used as the “true” number of clusters. The corresponding results are displayed in Table 7. The original numbers of classes and their estimated clusters are well correlated with the results obtained by the RSPCE algorithm.

Table 6 Comparison of the estimated number of clusters produced by different methods on the synthetic datasets

Full size table

Table 7 Comparison of the estimated number of clusters produced by different methods on the real-world datasets

Full size table

Computational efficiency

In this section, we compare the computation efficiency of seven methods for identifying the number of clusters against different data sizes. The results are plotted in Fig. 7, with the execution time measured in minutes. We can see that comparatively, Gap statistic and Silhouette methods were inefficient. Other methods performed on these datasets similarly in execution time.

It is noteworthy that we adopt a subset of random samples from the big dataset to approximate results as the estimation of the entire dataset. Thus, the proposed RSPCE approach does not require to analyze the entire dataset altogether.

Conclusions

In this paper, we proposed a multiple random sample-based ensemble method to estimate the number of clusters in a large dataset. We partitioned a big dataset into a set of RSP data blocks as random samples of the big dataset. Then, we randomly select a subset of data blocks and identify the number of clusters independently. Finally, we ensemble the results of the multiple random samples as an estimate of the entire dataset. Moreover, a cluster ball model was introduced to ensemble the clusters of the random samples that are likely sampled from the same cluster in the big dataset.

We conducted extensive experiments to investigate the effectiveness and stability of the RSPCE algorithm and further analyzed the impact of the sample size and the ensemble size. The experimental results demonstrated that the proposed algorithm was capable of generating good approximations of the actual cluster centers in the big dataset from a few random samples. The experiment results also demonstrated that the RSPCE algorithm is scalable to big data and flexible for clustering large-scale data on single machines or a cluster.

One should note that our cluster ball model is only suitable for merging clusters in spherical shapes. This is a limit of the RSPCE algorithm when it is applied to the dataset with clusters of irregular shapes. In future work, we will address this issue by adopting the graph ensemble for high-dimensional and complex non-linear manifold structure datasets, such as moon-shaped and Swiss-roll data. Besides, we will investigate a statistical framework to design an ensemble for distributed clustering that exercises both the weight information and the efficiency of multiple random samples.

Availability of data and materials

The code and experimental data are available at https://github.com/sultanszu/RSPCE.git.

Notes

References

Thorndike RL. Who belongs in the family. Psychometrika. 1953. https://doi.org/10.1007/BF02289263.
Article Google Scholar
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7.
Article MATH Google Scholar
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Series B. 2001;63(2):411–23. https://doi.org/10.1111/1467-9868.00293.
Article MathSciNet MATH Google Scholar
Masud MA, Huang JZ, Wei C, Wang J, Khan I, Zhong M. I-nice: a new approach for identifying the number of clusters and initial cluster centres. Inf Sci. 2018;466:129–51. https://doi.org/10.1016/j.ins.2018.07.034.
Article Google Scholar
Nair R. Big data needs approximate computing: technical perspective. Commun ACM. 2014;58(1):104–104. https://doi.org/10.1145/2688072.
Article Google Scholar
Meng X-L. Statistical paradises and paradoxes in big data (i): law of large populations, big data paradox, and the 2016 US presidential election. Ann Appl Stat. 2018;12(2):685–726. https://doi.org/10.1214/18-AOAS1161SF.
Article MathSciNet MATH Google Scholar
Rojas, J.A.R., Beth Kery, M., Rosenthal, S., Dey, A.: Sampling techniques to improve big data exploration. In: 2017 IEEE 7th Symp. Large Data Analy Vis. 2017. 10.1109/LDAV.2017.8231848
Salloum S, Huang JZ, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Informat. 2019;15(11):5846–54. https://doi.org/10.1109/TII.2019.2912723.
Article Google Scholar
Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal. 2020;3(2):85–101.
Article Google Scholar
He Y, Wu Y, Qin H, Huang JZ, Jin Y. Improved i-nice clustering algorithm based on density peaks mechanism. Inf Sci. 2021;548:177–90. https://doi.org/10.1016/j.ins.2020.09.068.
Article MathSciNet Google Scholar
Xu X, Ding S, Wang Y, Wang L, Jia W. A fast density peaks clustering algorithm with sparse search. Inform Sci. 2021;554:61–83. https://doi.org/10.1016/j.ins.2020.11.050.
Article MathSciNet Google Scholar
Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014;344(6191):1492–6. https://doi.org/10.1126/science.1242072.
Article Google Scholar
Schubert E, Sander J, Ester M, Kriegel HP, Xu X. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst. 2017. https://doi.org/10.1145/3068335.
Article MathSciNet Google Scholar
Patil C, Baidari I. Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng. 2019;4:132–40.
Article Google Scholar
Zhao X, Liang J, Dang C. A stratified sampling based clustering algorithm for large-scale data. Know Based Syst. 2019;163:416–28. https://doi.org/10.1016/j.knosys.2018.09.007.
Article Google Scholar
Jia J, Xiao X, Liu B, Jiao L. Bagging-based spectral clustering ensemble selection. Pattern Recognit Lett. 2011;32(10):1456–67. https://doi.org/10.1016/j.patrec.2011.04.008.
Article Google Scholar
Wang Y, Chen L, Mei J. Incremental fuzzy clustering with multiple medoids for large data. IEEE Trans Fuzzy Syst. 2014;22(6):1557–68. https://doi.org/10.1109/TFUZZ.2014.2298244.
Article Google Scholar
Hu J, Li T, Luo C, Fujita H, Yang Y. Incremental fuzzy cluster ensemble learning based on rough set theory. Know Based Syst. 2017;132:144–55. https://doi.org/10.1016/j.knosys.2017.06.020.
Article Google Scholar
Bagirov AM, Ugon J, Webb D. Fast modified global k-means algorithm for incremental cluster construction. Pattern Recognit. 2011;44(4):866–76. https://doi.org/10.1016/j.patcog.2010.10.018.
Article MATH Google Scholar
Mimaroglu S, Erdil E. Combining multiple clusterings using similarity graph. Pattern Recogn. 2011. https://doi.org/10.1016/j.patcog.2010.09.008.
Article MATH Google Scholar
Huang D, Lai J, Wang CD. Ensemble clustering using factor graph. Pattern Recognit. 2016;50(C):131–42. https://doi.org/10.1016/j.patcog.2015.08.015.
Article MATH Google Scholar
Ayad HG, Kamel MS. On voting-based consensus of cluster ensembles. Pattern Recognit. 2010;43(5):1943–53. https://doi.org/10.1016/j.patcog.2009.11.012.
Article MATH Google Scholar
Iam-On N, Boongoen T, Garrett S, Price C. A link-based approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell. 2011;33(12):2396–409. https://doi.org/10.1109/TPAMI.2011.84.
Article Google Scholar
Yang J, Liang J, Wang K, Rosin PL, Yang M. Subspace clustering via good neighbors. IEEE Trans Pattern Anal. 2020;42(6):1537–44. https://doi.org/10.1109/TPAMI.2019.2913863.
Article Google Scholar
Fang Y, Wang J. Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal. 2012;56(3):468–77. https://doi.org/10.1016/j.csda.2011.09.003.
Article MathSciNet MATH Google Scholar
Estiri H, Abounia Omran B, Murphy SN. kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Res. 2018;13:38–51. https://doi.org/10.1016/j.bdr.2018.05.003.
Article Google Scholar
Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proc. 17th Int. Conf. Mach. Learn. ICML ’00, pp. 727–734. Morgan Kaufmann Publishers Inc., CA, USA 2000.
Bachem, O., Lucic, M., Krause, A.: Scalable k-means clustering via lightweight coresets. In: Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD’18), NY, USA, pp. 1119–1127 (2018). 10.1145/3219819.3219973.
Wu J, Liu H, Xiong H, Cao J, Chen J. K-means-based consensus clustering: a unified view. IEEE Trans Knowl Data Eng. 2015;27(1):155–69. https://doi.org/10.1109/TKDE.2014.2316512.
Article Google Scholar
Iam-On N, Boongeon T, Garrett S, Price C. A link-based cluster ensemble approach for categorical data clustering. IEEE Trans Knowl Data Eng. 2012;24(3):413–25.
Article Google Scholar
Ren Y, Domeniconi C, Zhang G, Yu G. Weighted-object ensemble clustering: Methods and analysis. Knowl Inf Syst. 2017;51(2):661–89. https://doi.org/10.1007/s10115-016-0988-y.
Article Google Scholar
Brock G, Pihur V, Datta S, Datta S. clvalid: an r package for cluster validation. J Stat Softw. 2008;25(4):1–22. https://doi.org/10.18637/jss.v025.i04.
Article Google Scholar
Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979;1(2):224–7. https://doi.org/10.1109/TPAMI.1979.4766909.
Article Google Scholar
Rosenberg, A., Hirschberg, J.: V-measure: A conditional entropy-based external cluster evaluation measure. In: Proc. 2007 Joint Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn. (EMNLP-CoNLL), pp. 410–420. Association for Computational Linguistics, Prague, Czech Republic 2007. 10.1109/10.7916/D80V8N84.
Lawrence H, Phipps A. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/BF01908075.
Article MATH Google Scholar
Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837–54. https://doi.org/10.5555/1756006.1953024.
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

Not applicable.

Funding

This research has been supported by the National Natural Science Foundation of China Grant-61972261.

Author information

Authors and Affiliations

Big Data Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen, 518060, China
Mohammad Sultan Mahmud, Joshua Zhexue Huang & Kaishun Wu
National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen, 518060, China
Mohammad Sultan Mahmud, Joshua Zhexue Huang & Kaishun Wu
Guangdong Laboratory of Artificial Intelligence and Digital Economy, Shenzhen, 518107, China
Rukhsana Ruby

Authors

Mohammad Sultan Mahmud
View author publications
You can also search for this author in PubMed Google Scholar
Joshua Zhexue Huang
View author publications
You can also search for this author in PubMed Google Scholar
Rukhsana Ruby
View author publications
You can also search for this author in PubMed Google Scholar
Kaishun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

MS Mahmud: Conceptualization, Investigation, Software, Methodology, Validation, Writing - original draft; JZ Huang: Supervision, Methodology, Funding acquisition, Writing - review & editing; RR: Project administration, Validation, Writing - review & editing; KW: Funding acquisition, Resources. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Joshua Zhexue Huang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Compeing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Mahmud, M.S., Huang, J.Z., Ruby, R. et al. An ensemble method for estimating the number of clusters in a big data set using multiple random samples. J Big Data 10, 40 (2023). https://doi.org/10.1186/s40537-023-00709-4

Download citation

Received: 15 July 2022
Accepted: 28 February 2023
Published: 01 April 2023
DOI: https://doi.org/10.1186/s40537-023-00709-4

An ensemble method for estimating the number of clusters in a big data set using multiple random samples

Abstract

Introduction

Related work

Ensemble method for estimating the number of clusters

Multiple random samples of a big dataset

Definition 1

Definition 2

Finding the number of clusters in a random sample

Ball model for representing clusters

Definition 3

Definition 4

Definition 5

Definition 6

Ensemble method for merging clusters with ball model

The proposed RSPCE algorithm

Algorithm for generating multiple samples

Algorithm for finding the number of clusters in a random sample

Algorithm for identifying two clusters being one using ball model

Algorithm for ensembling the numbers of clusters in multiple samples

Complexity analysis

Experiments

Datasets

Experiment settings

Evaluation metrics

Internal measures

Stability measures

Experiment results and analysis

Results of the number of clusters

Improvements of clustering results

Statistical homogeneity test

Comparisons of RSPCE with other methods

Computational efficiency

Conclusions

Availability of data and materials

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Compeing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords