 Research
 Open Access
 Published:
An ensemble method for estimating the number of clusters in a big data set using multiple random samples
Journal of Big Data volume 10, Article number: 40 (2023)
Abstract
Clustering a big dataset without knowing the number of clusters presents a big challenge to many existing clustering algorithms. In this paper, we propose a Random Sample Partitionbased Centers Ensemble (RSPCE) algorithm to identify the number of clusters in a big dataset. In this algorithm, a set of disjoint random samples is selected from the big dataset, and the IniceDP algorithm is used to identify the number of clusters and initial centers in each sample. Subsequently, a cluster ball model is proposed to merge two clusters in the random samples that are likely sampled from the same cluster in the big dataset. Finally, based on the ball model, the RSPCE ensemble method is used to ensemble the results of all samples into the final result as a set of initial cluster centers in the big dataset. Intensive experiments were conducted on both synthetic and real datasets to validate the feasibility and effectiveness of the proposed RSPCE algorithm. The experimental results show that the ensemble result from multiple random samples is a reliable approximation of the actual number of clusters, and the RSPCE algorithm is scalable to big data.
Introduction
In this paper, we propose an ensemble method for estimating the number of clusters in a big dataset, the most important parameter in many clustering algorithms like kmeans. Because this parameter, often unknown in unlabeled data, is often guessed by the user, incorrect guesses result in inaccurate clustering results. Therefore, finding this number can improve the clustering result. However, automatic identification of the number of clusters in a big dataset is a challenge to the classical methods, e.g., Elbow [1], Silhouette [2], Gap statistic [3], and Inice [4], due to the data size and the complexity of the inherent clusters in the data. A strategy we take here is to use multiple samples of the big dataset to estimate several possible values of the number of clusters and then ensemble the multiple results to improve the final estimate.
Since it is hard or impractical to investigate a big dataset entirely, using a random sample to compute an approximate result for the whole big dataset often becomes imperative [5, 6]. Random sampling is also a popular technique widely used by data scientists to quickly gain insights from a big dataset, despite theoretical and empirical evidence of the benefits of other sampling techniques [7]. However, sampling a big dataset is an errorprone and inefficient process when the dataset cannot be held in memory. Furthermore, to obtain accurate approximations, it is essential to efficiently select random samples from a big dataset, and in the meantime, to guarantee the quality of selected samples. As such, we adopt the random sample partition (RSP) [8] data model to represent a big dataset, which allows the blocklevel sampling methods to be used to efficiently select multiple random samples from the big dataset. The critical technical challenges of partitioning and sampling big datasets are highlighted in [9].
Ensemble approach is widely used to tackle the problems of clustering complex data. For clustering big complex datasets, developing scalable and appropriate ensemble methods is the main challenge. The classical clustering ensemble methods combine the outcomes from different models or algorithms on the same dataset to produce an ensemble result, but they are only appropriate for small or moderatesized datasets. For ensemble clustering of a big dataset, it is required to ensemble the results obtained from different disjointed random samples of the big dataset. In this case, each object can only appear in one clustering result, and the object identifications in different clustering results are lost. The classical clustering ensemble methods are no longer applicable. Therefore, it is necessary to investigate an appropriate ensemble model with new integration functions to ensemble the clustering results from the disjoint random samples.
In response to this, in this paper, we investigate a new clustering ensemble method that is dataadaptive and approximate in estimating the number of clusters. For largescale data clustering, we are aimed to developing a feasible distributed clustering algorithm that (i) incorporates with a scalable serial algorithm effectively, (ii) runs efficiently on the distributed platform, and (iii) does not require processing the entire dataset. To achieve this goal, we propose a new ensemble algorithm for estimating the number of clusters in a big dataset using multiple random samples. We name our algorithm RSPCE, representing the abbreviation of RSPbased Centers Ensemble. The RSPCE algorithm includes the following steps: (1) division of a big dataset into subsets of random samples, called RSP data blocks, which form the RSP data model; (2) random selection of a subset of RSP data blocks and identification of the number of clusters and the initial cluster centers in each RSP data block by the IniceDP [10] algorithm; and (3) generation of the final ensemble result as a set of initial centers of K clusters by the RSPCE ensemble method that uses the cluster ball model to merge clusters in random samples which are likely sampled from the same cluster in the big dataset. Unlike the classical ensemble clustering methods, the RSPCE algorithm does not depend on common object ids to ensemble the component clustering results.
We conducted experiments on both synthetic and realworld datasets. The experiment results have shown that the new method is computationally effective and efficient in finding the number of clusters in a big dataset and their initial cluster centers. The experiment results also show that the RSPCE algorithm produces good approximations of the actual numbers of clusters in both synthetic and realworld datasets.
The rest of the paper is organized as follows: In Sect. Related work, we briefly discuss existing work on largescale data clustering processes. In Sect. Ensemble method for estimating the number of clusters, we introduce preliminaries in the context of this work. In Sect. The proposed RSPCE algorithm, we present the proposed RSPCE scheme. In Sect. Experiments, we evaluate the performance of the proposed scheme through experimental results. Finally, Sect. Conclusions concludes the paper.
Related work
The number of clusters is an important parameter in many wellestablished clustering algorithms (e.g., kmeans, kmedoids). This number is unknown in unlabeled datasets, and it is often guessed by the user. The Elbow [1], Silhouette coefficient [2], and Gap statistic [3] are wellknown methods for finding the “right” number of clusters in a dataset. These methods identify the number of clusters in a dataset by measuring the quality of several clustering results with different numbers of clusters. However, these methods do not work well on big datasets with a large number of clusters, and they are computationally expensive to use because multiple clustering results must be generated.
Inice [4] is eminent among the densitybased clustering algorithms for estimating the number of clusters that can produce highquality initial seeds on small datasets. In the Inice algorithm, the observation points are assigned to the data space for observing the dense regions of clusters in data through the distance distributions between the observation points and objects. Then, to find the number of peaks in a distance distribution, multiple gamma mixture models (GMMs) are built with different components, and the GMM is solved via the EM algorithm. The minimum Akaike information criterion (AICc) is used to select the bestfitted model and the observed largest number of components as the number of clusters.
Automatic clustering algorithms are attracting more attention from the academic community, e.g., densitybased clustering and data depth clustering [11, 12]. Densitybased algorithms, such as DBSCAN, can cluster datasets with convex shapes and noisy objects, but it is difficult to determine the density threshold [13]. The depth difference method [14] estimates the depth within clusters, the depth between clusters, and the depth difference to finalize the optimal value of K. However, for datasets with complex decision graphs, it is difficult to correctly identify clustering centers.
In practice, according to data preprocessing strategies, big data clustering processes can be classified as samplingbased, incremental, condensationbased, and divideandconquer strategybased. Samplingbased techniques typically select a subset of a given dataset, employ only the sampled subsets to find the number of clusters, and then allow the remaining data to obtain the final outcome [15, 16]. The success of samplebased methods depends on the premise that the chosen representative samples have significant information about the dataset. Incremental approaches, on the other hand, reduce the computation time by scanning the data points only once [17, 18]. For fast clustering, modified global kmeans [19] and multiple medoidsbased fuzzy clustering [17] methods have been developed based on the idea of incremental clustering. Condensationbased methods speed up the performance by encapsulating the data into a special data structure, such as trees and graphs [20, 21]. Divideandconquer strategies split the big dataset into several subsets or subspaces that can fit into the memory. Later, the clustering algorithms are applied to these subsets or subspaces independently (please see [22,23,24]). The final clustering results are obtained by merging the partial clusters of subsets or subspaces.
For big data clustering, a bootstrap method [25] was proposed to estimate the number of clusters, which minimizes the corresponding estimated clustering instability. The kluster [26] procedure takes randomly selected clusters as the initial seeds to determine the final number of clusters in a dataset in an iterative manner, which confirms the most frequent mean of the resulting clusters from the iterations as the optimal number of clusters. On the other hand, the Xmeans method [27] automatically determines the number of clusters based on Bayesian information criterion (BIC) scores. At each iteration, this method executes local decisions about the selection of the current center split to better fit the data. In the same vein, coresets [28] have been constructed to scale to massive datasets with the distributed clustering idea.
Although many clustering ensemble techniques have been developed (such as [22, 23, 29,30,31]), they either produce incorrect results or are inefficient for use in big data applications. The traditional methods [1,2,3] for finding the K value run the clustering algorithm several times with a different K value for each run, which is not suitable for big data analysis. We observe that there are two limitations to the traditional clustering approaches. First, in a typical scenario, the algorithms require the number of clusters in advance for the clustering process. Second, traditional algorithms operate at the object level and are incapable of dealing with clustering ensembles with big data and large ensemble sizes. Furthermore, the classical ensemble technique combines the results of different models or algorithms on the same dataset to produce a robust result, where a scalable method is required to identify the number of clusters in a big dataset. We focus on the datasubset clustering ensemble technique, which is an approximate computing method to estimate the correct clustering outcome from the subsets of a big dataset.
Ensemble method for estimating the number of clusters
In this section, we propose a new method that uses multiple random samples of a big dataset to identify the number of clusters. We first give the definition of a random sample from a big dataset and define the random sample partition data model to represent a big dataset as a partition of random samples. Then, we present the IniceDP method for finding the number of clusters in a random sample. Finally, we propose a ball model for integrating the results of multiple random samples into the ensemble result as the number of clusters in the big dataset.
Multiple random samples of a big dataset
The existing methods are computationally infeasible for finding the number of clusters in a big dataset. Instead, using a random sample to estimate is an acceptable choice. However, a large sample could possibly result in an accurate estimate but will be computationally expensive. An alternative is to use multiple random samples of smaller size for this purpose. In this case, we need to solve the problems of drawing multiple random samples from a big dataset effectively and efficiently, and ensemble the results of those multiple random samples into the final result.
Definition 1
(Random sample of a big dataset) Let D be a subset of big dataset \({\mathbb {D}}\), i.e., \(D \ \subset {\mathbb {D}}\). D is a random sample of \({\mathbb {D}}\) if
where F() is the cumulative distribution function.
The simple random sampling process can be used on \({\mathbb {D}}\) to generate D, which satisfies this definition. However, sampling a distributed big data file to generate multiple independent random samples is a timeconsuming process. In this work, we use the random sample partition data model to represent a big dataset as a set of random sample data blocks, so the blocklevel sampling method is used to efficiently select multiple random samples.
The random sample partition (RSP) [8] is defined as follows:
Definition 2
(Random sample partition of a big dataset) Let \({\mathbb {D}}\) be a big dataset and \(\{D_1,D_2,...,D_m\}\) be a set of m random samples of \({\mathbb {D}}\). \(\{D_1,D_2,...,D_m\}\) is a random sample partition of \({\mathbb {D}}\) if

\(D_i \ne \emptyset\)

\(D_i\cap D_j = \emptyset\)

\(\bigcup _{i=1}^{m}D_i ={\mathbb {D}}\)

\(F(D_i) \approx F({\mathbb {D}}), \qquad 1 \le i \le m\)
The first three conditions define a partition of \({\mathbb {D}}\), whereas the last condition categorizes a random sample partition.
In the random sample partition, all RSP data blocks \(\{D_1,D_2,...,D_m\}\) satisfy the definition of a random sample of \({\mathbb {D}}\). To generate multiple random samples, we simply randomly select a few RSP data blocks from the RSP data model without going through all the records of \({\mathbb {D}}\) several times. As a result, the sampling process of multiple random samples is improved significantly.
Finding the number of clusters in a random sample
Given a set of random samples as a set of RSP data blocks \(\{D_1, D_2,..., D_b\}\) where \((b < m)\), one important step is to find the number of clusters in each sample \(D_i\). In this work, we are not only interested in the number of clusters K, but also the initial centers of the K clusters in each subset \(D_i\). For this reason, we first use the density peakbased algorithm \(\text{ IniceDP }\) [10] to compute K and the initial centers of clusters in a random sample. Then, we use the kmeans algorithm to cluster the random sample and refine the cluster centers. The \(\text{ IniceDP }\) operator is defined as follows:
where \(\text{ IniceDP }\) is an operator on random sample \(D_i\), and \((k_i, C_i)\) are the two return values of the function. The first term is the number of clusters in \(D_i\), and the second term \(C_i= \{c_i, c_2,...,c_{k_i}\}\) is the set of centers of the k clusters.
After \((k_i, C_i)\) are obtained from \(D_i\), they are used as the input parameters to the kmeans algorithm to compute k refined cluster centers in \(D_i\) as follows:
where \(C_{i}^{*}=\{c_i^{*}, c_2^{*},...,c_{k_i}^{*}\}\) is the set of the refined centers of \(k_i\) clusters in \(D_i\).
Applying the two operators \(\text{ IniceDP }\) and \(k\text{means }\) to all b random samples \(\{D_1, D_2,..., D_b\}\), we obtain b sets of refined centers and make union of these sets to form a new set as
where \(C_{i}^{*}\) is the result of \(k\text{means }(k_i, C_i, D_i)\).
The set \(C^{*}\) contains totally \(K=\sum _{i=1}^b{k_i}\) cluster centers from b random samples. Since the random samples are taken from the same big dataset, they should have similar inherent clusters. Therefore, the numbers of clusters in them should be very close to each other, and the centers of clusters in different random samples should also be located closely. Considering these properties, in the next subsection, we propose a method, called the ball model, to be used to aggregate the nearby centers in \(C^{*}\) into an ensemble set of centers as the initial cluster centers in the big dataset.
Ball model for representing clusters
Given a random sample \(D_i\), the kmeans operator of (3) generates a set of \(k_i\) clusters. The centers of clusters in the same random sample should be separate, but the centers of clusters in different random samples can be very close to each other. In this case, the two clusters in the two random samples may represent the same cluster in the big dataset. Therefore, the two centers should be merged into one, indicating the same cluster of the big dataset. In this subsection, we will determine whether two clusters in two different random samples represent the same cluster in the big dataset.
Since the kmeans clustering process produces spherical clusters, we propose a ball model to represent a spherical cluster as a ball. The main features of this ball model are defined below.
Definition 3
(Radius of a cluster ball) Let \(C_i\) be a cluster of n points, and \(c_i\) the center of the cluster. The radius of the cluster ball, \(r_i\), is defined as the average of the distances between all points in \(C_i\) and its center \(c_i\) below
where \(x_i\) is an object in \(C_i\).
We use the average distance to define the radius of cluster ball to reduce the impact of outliers, i.e., few points faraway from the cluster center.
Definition 4
(Cluster ball) Let \(C_i\) be a cluster. Its cluster ball, denoted as \(\text{ CB}_i\), is defined as a 3tuple
where \(c_i\) and \(r_i\) are the center and radius of cluster \(C_i\), respectively. Note that, in the literature, “centers” and “centroids” are used, alternatively and represent the same meaning.
We can use cluster balls to determine whether two clusters are well separated or overlapping.
Definition 5
(Wellseparate and overlapping clusters) Let \(\text{ CB}_i\) and \(\text{ CB}_j\) be two balls of clusters \(C_i\) and \(C_j\), respectively. We say that \(C_i\) and \(C_j\) are wellseparated if \(\text{ CB}_i\) and \(\text{ CB}_j\) are disjoint. If \(\text{ CB}_i\) and \(\text{ CB}_j\) intersect, we say that \(C_i\) and \(C_j\) are overlapping.
Figure 1 illustrates an example of merging the clusters from two random samples, RSP block i in Fig. 1a and RSP block j in Fig. 1b. Three clusters are found in each random sample and represented as three cluster balls. Figure 1c shows clusters \(C_1\) and \(C_6\) intersect, and clusters \(C_3\) and \(C_5\) intersect, whereas clusters \(C_2\) and \(C_4\) are separated from others. Two overlapping clusters are likely to be the same cluster of the big dataset. The definition below defines a property to merge two overlapping clusters.
Definition 6
(\(\frac{1}{2}\)ball intersection property) Let \(\text{ CB}_i\) and \(\text{ CB}_j\) be two cluster balls. We say that \(\text{ CB}_i\) and \(\text{ CB}_j\) have \(\frac{1}{2}\)ball intersection property if \(\Vert c_ic_j \Vert \le \frac{1}{2}(r_i+r_j)\), where \((r_i, r_j > 0)\). If two cluster balls have a \(\frac{1}{2}\)ball property, they are strongly proximal.
This property is used to determine whether two clusters found from different random samples indicate the same cluster of the big dataset. If so, they can be merged into one cluster. If two clusters indicate the same cluster in the big dataset, their cluster balls must intersect and satisfy the \(\frac{1}{2}\)ball intersection property. In this case, the intersected cluster balls in random samples are merged to obtain the optimal number of clusters in the big dataset. For example, in Fig. 1 (c), cluster balls \(\text{ CB}_1\) and \(\text{ CB}_6\) satisfy this property, so they are likely sampled from the same cluster of the big dataset and need to merge into one cluster.
Ensemble method for merging clusters with ball model
Using Definition 6, we can integrate the set of cluster centers in \(C^{*}\) into the ensemble set of centers as the initial cluster centers of the big dataset. This process is carried out as follows:

1
Randomly select a center \(c_p^{*}\) from \(C^{*}\). Make the center \(c_p^{*}\) as a candidate of the final centers in the set of \(\text{ CF }\), i.e., final set of centers. Compute the cluster ball \(\text{ CB}_p^{*}\) and remove \(c_p^{*}\) from \(C^{*}\).

2
Randomly select a center \(c_q^{*}\) from \(C^{*}\) and compute the cluster ball \(\text{ CB}_q^{*}\).

3
Compute the \(\frac{1}{2}\)ball intersection property of the two cluster balls \(\text{ CB}_p^{*}\) and \(\text{ CB}_q^{*}\).

4
If the two balls are disjoint, ignore the second ball \(\text{ CB}_q^{*}\) and go to Step 2; otherwise, if the two balls do not satisfy the \(\frac{1}{2}\)ball intersection property of Definition 6, ignore the second ball \(\text{ CB}_q^{*}\) and go to Step 2; otherwise, add the cluster ball in the set \(\text{ CF }\). If all centers in \(C^{*}\) have been tested, go to next step; otherwise, go to Step 2.

5
Merge the centers in \(\text{ CF }\) by computing the mean of the centers as the center of a cluster in the big datset and go to Step 1 until the centers of all clusters in the big dataset are found.
Since we have included the radius of the cluster ball in the ball model, it is straightforward to test whether two cluster balls are disjoint or not. However, when two cluster balls intersect, Definition 6 plays an important role in determining the merge of two clusters. The intersection of two cluster balls satisfying the \(\frac{1}{2}\)ball intersection property is a much stronger indication that the two clusters could be the same cluster in the big dataset.
The proposed RSPCE algorithm
In this section, we present the algorithms used in the basic steps of the RSPCE algorithm for estimating the number of clusters in a big dataset and finding the initial cluster centers. The basic steps are summarized as follows:

1
Given a big dataset, generate its RSP data model for efficiently selecting multiple random samples.

2
For each random sample, find the number of clusters and the centers of the clusters.

3
For given two clusters, use the ball model to determine whether two clusters could be the same cluster in the big dataset.

4
Use the ball model to integrate the clusters from multiple random samples into an ensemble set of clusters and initial cluster centers.
In the following, we present the algorithms in each step and give a complexity analysis of the RSPCE algorithm.
Algorithm for generating multiple samples
In this work, we use the random sample partition (RSP) data model to convert a big dataset into a set of disjoint random sample data blocks, so that each data block is used as a random sample of the big dataset. Therefore, to identify the number of clusters in a big dataset, we use Algorithm 1 to convert it to a set of RSP data block files for random sample selection.
The inputs of the algorithm are a big dataset \({\mathbb {D}}\) and the size of each RSP data block n. The output is a set of m RSP data blocks, where \(m=N/n\), which are saved as a set of RSP data block files \(\{D_1, D_2,..., D_m\}\).
Algorithm 1 is executed as follows: Lines 23 compute the number of objects N and the number of RSP blocks m. Line 4 generates a sequence of N unique random numbers following a uniform distribution. Line 5 appends the sequence of random numbers as one additional id in \({\mathbb {D}}\). Line 6 randomizes the records of \({\mathbb {D}}\) by sorting the records on the random number id. Lines 811 cut the sequence of the randomized records of \({\mathbb {D}}\) sequentially into m subsequences, each being written as an RSP data block file.
Algorithm for finding the number of clusters in a random sample
We model clusters in a big dataset as normal distributions where each cluster has a high density area, which is reflected as the density peak of the normal distribution. Therefore, the number of clusters in a dataset is corresponding to the number of density peaks in the dataset. The \(\text{ IniceDP }\) algorithm [10] (i.e., an improved version of Inice) was designed for identifying the number of clusters in a dataset by finding the number of density peaks in the distance distribution of objects in the dataset with respect to an observation point. Therefore, \(\text{ IniceDP }\) is chosen as the operator to identify the number of clusters in each random sample. The pseudocode of the \(\text{ IniceDP }\) algorithm is presented in Algorithm 2. The input to the algorithm is an RSP data representation of a big dataset and the number of RSP data blocks b. The output is a set of b values indicating the numbers of clusters found in the b random samples and b sets of cluster centers.
The \(\text{ IniceDP }\) algorithm is explained below. First, Line 2 randomly selects b RSP data blocks. Starting from Line 3, each RSP data block is computed separately as follows:

1
Line 4 generates an observation point as a reference for computing the distance distribution of objects.

2
Line 5 computes a set of distances between the observation point and the data points of an RSP sample to form the distance vector.

3
Line 6 computes the maximal number of GMM components \({\mathcal {M}}_{\max }\) using the kernel density estimation (KDE) method, where \(\Delta _1\) and \(\Delta _2\) are two thresholds that control the potential number of components.

4
Lines 710 compute a set of GMMs from the distance vector with the number of components smaller than or equal to the maximal number \({\mathcal {M}}_{\max }\), and each GMM model is built using the EM algorithm.

5
Lines 1214 select the most fitted model based on the AICc criterion.

6
Lines 15–17 determine the highdensity data points for each GMM component using the density peak mechanism, and these highdensity data points are used as the initial cluster centers.

7
Finally, lines 1819 assign the initial cluster centers to the kmeans algorithm to cluster the input data and refine the cluster centers as the output result of the random sample.
After all RSP data blocks are computed, the set of refined cluster centers as defined in Eq. (4) is generated.
Algorithm for identifying two clusters being one using ball model
The \(\text{ IniceDP }\) algorithm generates a set of clusters from b random samples. Some of these clusters are likely sampled from the same cluster of the big dataset, so they have to be merged into one cluster as an approximation of the true cluster in the big dataset. Algorithm 3 is designed to use the ball model to identify the two clusters which are likely to be one cluster in the big dataset.
The inputs to the algorithm are two clusters \(C_i\) and \(C_j\) and their cluster centers \(c_i\) and \(c_j\). Line 2 computes the radii of the two clusters \(r_i\) and \(r_j\). Lines 34 build two cluster balls \(\text{ CB}_i\) and \(\text{ CB}_j\). Line 6 checks if the two balls are disjoint, set \(\text{ Merge } \text{= } \text{ false }\). Lines 78 check if the two balls overlap, set \(\text{ Merge } \text{= } \text{ false }\); otherwise, set \(\text{ Merge } \text{= } \text{ true }\) in Line 10. Output \(\text{ CB}_j\) if \(\text{ Merge } \text{= } \text{ true }\); otherwise, output nothing.
Algorithm for ensembling the numbers of clusters in multiple samples
Finally, the pseudo code of the RSPCE algorithm is illustrated in Algorithm 4. The inputs are a big dataset \({\mathbb {D}}\) and the sample size n. First, in Line 2, Algorithm 1, shown as the operator RSP(), is called to convert \({\mathbb {D}}\) into a set of m RSP data blocks. Line 3 randomly selects b RSP blocks. Lines 4–8 call Algorithm 2 as operator \(\text{ IniceDP }()\) and operator \(k\text{means }()\) to compute the set of initial cluster centers \(C^{*}\). Lines 11–13 call Algorithm 3 to find the clusters which are likely to be the same cluster. Line 14 merges the clusters as one and adds it to the set of the final cluster centers. Line 16 counts the number of the final cluster centers, CF. Finally, the algorithm outputs the number of clusters and the set of cluster centers.
Figure 2 illustrates the results of the three steps of the RSPCE algorithm. Figure 2a shows all refined centers found from 6 random samples of dataset DS1. We can see the sets of clusters from the 6 random samples are very similar. Figure 2b plots all cluster balls, and Fig. 2c shows the final set of cluster centers which are close to the true centers.
Complexity analysis
Given a big dataset with N objects, we have the following major parts that need to be considered: generating an RSP data representation, randomly selecting a subset of random samples; finding the number of clusters of each random sample, using the kmeans algorithm to refine the initial cluster centers of each random sample, and finally, using the ball model to ensemble the results of the multiple random samples.
Suppose the number of objects in each random sample is n, and b samples are randomly selected from m, where \(b < m\). The random sample generation operation has a complexity of \({\mathcal {O}}(n\log (N/n))\). IniceDP algorithm generates O onedimensional data and density peaks, and hence the time complexity of this algorithm is \({\mathcal {O}}(bnOK)\), where O is the number of observation points. The complexity of kmeans algorithm is \({\mathcal {O}}(bnTdK)\), where T is the maximum number of iterations. The time complexity of cluster ball learning process is \({\mathcal {O}}(bK)\). Therefore, the overall complexity of the RSPCE algorithm is \({\mathcal {O}}(n \log (N/n) + bnOK + bnTdK + bK )= {\mathcal {O}}(n\log (N/n) + (nO+nTd+1)bK)\), which is linear to the number of data blocks b.
The RSPCE algorithm is implemented in a distributed platform with Q nodes, the computational complexity of the RSPCE algorithm can be reduced to \(({\mathcal {O}}(n\log (N/n) + (nO+nTd+1)bK)) / Q\). Therefore, the proposed RSPCE algorithm is efficient and scalable.
Experiments
A series of experiments were conducted on both synthetic and realworld datasets to demonstrate the performance of the proposed RSPCE algorithm and show its practical efficiency. In this section, the datasets and the experiment settings are presented. Evaluation measures are defined. The experiment results are analyzed, and the homogeneity of the results is discussed. Finally, the computational efficiency and scalability of the algorithm are demonstrated.
Datasets
The characteristics of the synthetic and realworld datasets used in the experiments are summarized in Table 1 and described below:

Synthetic datasets. Five synthetic datasets, named DS1 to DS5, were generated in dimensions of 2 and 10 with different numbers of clusters in multivariate normal distributions. The numbers of clusters, the sizes of each cluster, the dimensions and total objects in these datasets are given in Table 1.

Realworld datasets. Four realworld datasets used in the experiments are the following: Covertype^{Footnote 1} dataset with 581,012 objects describes 7 forest cover types in 54 different geographic measurements. There are 84% of objects in 2 types (type1 36.5% and type2 48.7%). The rest 16% of the objects are in other 5 types. KDD’99ID^{Footnote 2} dataset with about 5 million objects describes the connections of sequences of network intrusion detection. It has 23 classes, and 98.3% of the dataset belong to 3 classes (normal 19.6%, neptune 21.6%, and smurf 56.8%). PokerHand^{Footnote 3} dataset has more than 1 million objects, each being an example of a hand consisting of five playing cards drawn from a standard deck of 52. The dataset has 10 predictive features and 10 classes with two dominant classes accounting for over 90% of the samples (nothing in hand 49.9% and one pair 42.4%). SUSY^{Footnote 4} dataset was generated with Monte Carlo simulations. It has 5 million objects, 18 features and 2 classes.
Experiment settings
In the experiments, all datasets were converted to RSP data representations, i.e., each dataset being transformed into a set of RSP data blocks. The left column of Table 2 shows the percentages of the total RSP blocks used to estimate the number of clusters. In the experiments, six different sizes of subsets of RSP blocks were used. The right column shows the two block sizes used to partition the synthetic datasets and the realworld dataset Covertype. The other three realworld datasets were partitioned with the block sizes of {1, 2, 5, 10, 15, and 20}% of the whole datasets. Therefore, each dataset is transformed into more than one RSP representation.
Six existing methods were selected for comparison of the performance of the proposed RSPCE algorithm. They are nselectboot [25], kluster [26], Xmeans [27], Elbow [1], Silhouette [2], and Gap statistics [3]. The number of clusters K in the last four methods was assigned to \(K=2~\text{ to }~100\). For the bootstrap method of kluster, the number of the bootstrap samples was set to 20.
The experiments were performed on three local nodes equipped with x64based processor, Intel(R) core i7–7700, CPU 3.60Hz, 8 GB of memory, and 1 TB of storage. The RSPCE algorithm was implemented in Python\(\)3.7.3. with py2r, fpc, densityClust and clvalid R packages. Three observation points were used in the step of IniceDP of the RSPCE algorithm.
Evaluation metrics
The internal and stability measures below were used to evaluate the results of the comparison methods and the RSPCE algorithm.
Internal measures
The following internal measures were used to evaluate the compactness, connectivity, and separation of the cluster partitions.
Inertia or withincluster sumofsquares (SSE) measures the internal coherence of objects in a cluster [32]. The lower the inertia value, the better the cluster. Zero is optimal.
The silhouette coefficient (SC) [2] evaluates the clustering quality by combining the ideas on how well the clusters are separated (i.e., separation) and how compact are the clusters (i.e., tightness). The SC of clusterings is computed as follows:
where a is the average distance between a cluster and all other data points in the same cluster, and b is the average distance between a cluster and all other data points in the nearest cluster. We can calculate the average SC as the mean of the SC for all samples. A higher score closer to 1 is related to a model with betterdefined clusters.
DaviesBouldin index (DBI) [33] is an internal evaluation metric, which is used to validate the clustering process using quantities and data points residing in the dataset. The DBI for K clusters is defined as
where \(\delta (C_i, C_j)\) is the intercluster distance, i.e., the distance between clusters \(C_i\) and \(C_j\), \(\Delta (C_i)\) is the intracluster distance of cluster \(C_i\), i.e., distance within the cluster \(C_i\). The lower the DBI value, the better the clustering result.
It is reasonable to define some intuitive metrics using conditional entropy analysis under the ground truth class assignments information. Vmeasure [34] is an entropybased measure that explicitly measures how successfully the criteria of homogeneity and completeness are satisfied. Vmeasure is computed as the harmonic mean of distinct homogeneity and completeness scores. Homogeneity captures only the information of the members in a single class for each cluster, whereas completeness captures the information of all members of a given class assigned to the same cluster. Vmeasure is equivalent to normalized mutual information (NMI) metric.
The adjusted rand index (ARI) [35] and the adjusted mutual information (AMI) [36] are also used to evaluate the performance of the RSPCE algorithm. These two measures are defined as follows:
Given the ground truth result \(P=\{C_1, C_2,..., C_k\}\) with K clusters and the predicted result \(P'=\{C'_1, C'_2,..., {C'}_{k'}\}\) with \(K'\) clusters, the adjusted rand index (ARI) [35] measures the similarity of the two assignments defined as
where
where \(i\in \{1,..., K\}, j\in \{1,..., K'\}, n\) is the total number of data samples, and \(\left . \right\) denotes the cardinality of the cluster. ARI value varies between zero and one. The higher value indicates that the resulted clustering outcome is more close to the actual one.
AMI [36] is defined as
where
AMI values are between 0 and 1. The higher the AMI value, the better the quality of clusters.
Stability measures
The average proportion of nonoverlap (APN) and the average distance between means (ADM) [32] were used to measure the stability and consistency of the results by comparing the ground truth of the entire dataset with the obtained number of clusters in a sample.
Let \(C_i\) represent the true clusters via an ideal clustering process, and \(C'_j\) be the clusters on the b random samples. Given the total number of clusters K, the APN measure is defined as
The APN resides in [0, 1], and the value close to zero corresponds to highly consistent clustering results.
The ADM computes the average distance between cluster centers determined based on the entire dataset and the random samples. It is defined as
where \(C_i\) is the mean of the objects in a cluster which contains object i on the entire dataset, and \(C'_j\) is the predicted one defined on the random samples. This metric is based on the Euclidean distance. It also has a value between 0 and 1, and smaller values are preferred.
Experiment results and analysis
Results of the number of clusters
The first set of experiments was to use five existing methods to identify the number of clusters from random samples of the five synthetic datasets in Table 1. Two sample sizes of 5000 points and 10,000 points were used. For each random sample in a synthetic dataset, the number of clusters in the sample was discovered by the five methods. Since there are m random samples in one synthetic dataset for each sample size, m results of the number of clusters were found by each method. The heatmaps of the results of the five methods on m random samples in the five synthetic datasets with two RSP representations each are shown in Fig. 3. The columns of each figure are the numbers of clusters, and the rows are the five methods. The dark color in a cell indicates that a high percentage of the m results was identified as the number of clusters by the column with the corresponding method. For example, in the right figure of Fig. 3a, the dark cell of the second column from the right in the row of Gap statistic implies that the majority number of clusters identified by Gap statistic from 100 random samples of data DS1 is 9. From Fig. 3, we can see that Gap statistic, Silhouette and Xmeans performed better than Elbow and IniceDP in identifying the number of clusters from the random samples of a big dataset. Another observation is that a bigger sample size results in a more accurate result.
Using the twodimensional synthetic datasets of DS1 and DS2, we compared the true cluster centers with the cluster centers identified by the five methods from the random samples of the two datasets. The results are plotted in Fig. 4. We can see that the cluster centers identified by IniceDP are closer to the true centers than those identified by the other four methods. These results indicate that IniceDP is more capable in identifying better initial cluster centers than other existing methods, so it is chosen in the operator to identify the number of clusters from a random sample in the algorithm.
Figure 5 shows the performance of the RSPCE algorithm in identifying the number of clusters in the five synthetic datasets with two different sample sizes. The horizontal axis in each plot is the number of random samples used by the algorithm. The vertical axis shows the number of clusters identified. The horizontal straight line indicates the true number of clusters in each dataset. For the same number of random samples, the RSPCE algorithm was run 20 times on each RSP representation of a synthetic dataset. From Fig. 5, we can see that for the twodimensional datasets DS1 and DS2 with fewer clusters, the RSPCE algorithm can easily identify the true number of clusters with a few random samples. For the highdimensional dataset DS3 with fewer clusters, the RSPCE algorithm can also converge to the true number of clusters as the number of random samples increased. However, for the two highdimensional datasets DS4 and DS5, the RSPCE algorithm needs more random samples to converge to the true number of clusters. Another observation in the ensemble method is that smaller samples gave better results than the bigger samples. The reason may be that more smaller samples generate more diverse results, which can improve the final ensemble result. However, more investigations are required to give a firm conclusion on this observation.
Improvements of clustering results
Having obtained the number of clusters in each synthetic dataset and the initial cluster centers by the RSPCE algorithm, we used the kmeans algorithm to cluster each random sample used in the RSPCE algorithm with the number of clusters and the initial cluster centers as input parameters. For each set of random samples, we used 8 internal measures to validate the clustering results. Table 3 shows all validation results of 5 synthetic datasets with 2 sample sizes and 6 different subsets of random samples listed in column es. We can see clearly that the clustering result was improved as more random samples were used by the RSPCE algorithm. Again, smaller random sample sizes resulted in better clustering results. This observation is consistent with the one from Fig. 5.
We also investigated the stability and consistency performance of the RSPCE algorithm in detecting cluster centers. The APN and ADM measures were used to evaluate the clustering consistency by comparing the results obtained in different numbers of random samples. The results of APN and ADM scores are shown in Table 4. It appears that the APN and ADM scores tend to decrease as the number of random samples increases. Again, the RSPCE algorithm performed significantly better on the datasets with smaller numbers of clusters. The set of random samples containing \(10\sim 20\%\) of the big dataset gave the better results. These results also show that the RSPCE algorithm can generate stable cluster centers which are close to the centers of true clusters in the entire dataset.
In the experiments, we observed that small samples (\(n<2000\)) often miss miniclusters in the sample, or they do not have enough points to categorize small clusters. Increasing the random sample size can solve this problem, but the computing cost also increases. A tradeoff on the sample size needs considerations in practice.
Statistical homogeneity test
Homogeneity tests were conducted to verify the cluster centers discovered from random samples. The distribution of the distances between these centers should be similar to the distribution of the distances of the true centers in the big dataset. We conducted twosamples KolmogorovSmirnov (KS)test and Ztest to compare the distance distributions of cluster centers between the entire dataset \({\mathcal {G}}\) and random samples \({\mathcal {A}}\).
The nullhypothesis is that \({\mathcal {G}}\) and \({\mathcal {A}}\) have the same distribution, whereas the alternative hypothesis implies that they have different distributions. We set \(h=1\) if we reject the nullhypothesis (i.e., the distributions are not the same); otherwise, we set \(h=0\) in the case of accepting the null hypothesis. We tested at a significant level of 5%. The pvalue is the probability of having a false rejection in the case of a null hypothesis. The corresponding test results are presented in Table 5. We can see that the nullhypothesis is accepted in all cases. Figure 6 illustrates that the test CDFs (green, blue, cyan, magenta, yellow, and black) of random samples match the empirical CDF (red) of the whole dataset closely, and the highest difference is small.
Comparisons of RSPCE with other methods
We compared the results of the RSPCE algorithm in identifying the number of clusters from multiple random samples with the results of other methods in identify the number of clusters from one random sample. Table 6 shows the results of the synthetic datasets, and Table 7 shows the results of the realworld datasets. The sample size is 5,000 points. Different numbers of random samples, as shown in column es, were used in the RSPCE algorithm. For the same number of random samples, the RSPCE algorithm ran 20 times on different sets of random samples. Other methods were applied to each random sample to generate one result. The average value and the standard deviations from the multiple runs were calculated. We can see that the RSPCE algorithm performed the best in general for both synthetic datasets and realworld datasets.
Specifically, Silhouette, Gap statistic and the RSPCE algorithm are more accurate in identifying the number of clusters in all five synthetic datasets. Among them, the RSPCE algorithm performed best. kluster and Elbow performed well in DS1 but not well in other datasets. Neither nselectboot nor Xmeans performed well in all datasets. It seems that the advantage of the bootstrap method in nselectboot did not work well. nselectboot, Xmeans, and Gap statistics all overestimated the number of clusters. Elbow and kluster, on the other hand, underestimated the number of clusters in the last four datasets.
Moreover, RSPCE was able to identify the cluster centers more accurately. Except for the Silhouette and Gap statistic methods, none of them were able to identify the cluster centers of five synthetic datasets. The Silhouette and Gap statistic algorithms are Euclidean distancebased, and hence computationally expensive.
We examined the effectiveness of the RSPCE algorithm on four realworld datasets. The number of classes in these datasets were used as the “true” number of clusters. The corresponding results are displayed in Table 7. The original numbers of classes and their estimated clusters are well correlated with the results obtained by the RSPCE algorithm.
Computational efficiency
In this section, we compare the computation efficiency of seven methods for identifying the number of clusters against different data sizes. The results are plotted in Fig. 7, with the execution time measured in minutes. We can see that comparatively, Gap statistic and Silhouette methods were inefficient. Other methods performed on these datasets similarly in execution time.
It is noteworthy that we adopt a subset of random samples from the big dataset to approximate results as the estimation of the entire dataset. Thus, the proposed RSPCE approach does not require to analyze the entire dataset altogether.
Conclusions
In this paper, we proposed a multiple random samplebased ensemble method to estimate the number of clusters in a large dataset. We partitioned a big dataset into a set of RSP data blocks as random samples of the big dataset. Then, we randomly select a subset of data blocks and identify the number of clusters independently. Finally, we ensemble the results of the multiple random samples as an estimate of the entire dataset. Moreover, a cluster ball model was introduced to ensemble the clusters of the random samples that are likely sampled from the same cluster in the big dataset.
We conducted extensive experiments to investigate the effectiveness and stability of the RSPCE algorithm and further analyzed the impact of the sample size and the ensemble size. The experimental results demonstrated that the proposed algorithm was capable of generating good approximations of the actual cluster centers in the big dataset from a few random samples. The experiment results also demonstrated that the RSPCE algorithm is scalable to big data and flexible for clustering largescale data on single machines or a cluster.
One should note that our cluster ball model is only suitable for merging clusters in spherical shapes. This is a limit of the RSPCE algorithm when it is applied to the dataset with clusters of irregular shapes. In future work, we will address this issue by adopting the graph ensemble for highdimensional and complex nonlinear manifold structure datasets, such as moonshaped and Swissroll data. Besides, we will investigate a statistical framework to design an ensemble for distributed clustering that exercises both the weight information and the efficiency of multiple random samples.
Availability of data and materials
The code and experimental data are available at https://github.com/sultanszu/RSPCE.git.
References
Thorndike RL. Who belongs in the family. Psychometrika. 1953. https://doi.org/10.1007/BF02289263.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65. https://doi.org/10.1016/03770427(87)901257.
Tibshirani R, Walther G, Hastie T. Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Series B. 2001;63(2):411–23. https://doi.org/10.1111/14679868.00293.
Masud MA, Huang JZ, Wei C, Wang J, Khan I, Zhong M. Inice: a new approach for identifying the number of clusters and initial cluster centres. Inf Sci. 2018;466:129–51. https://doi.org/10.1016/j.ins.2018.07.034.
Nair R. Big data needs approximate computing: technical perspective. Commun ACM. 2014;58(1):104–104. https://doi.org/10.1145/2688072.
Meng XL. Statistical paradises and paradoxes in big data (i): law of large populations, big data paradox, and the 2016 US presidential election. Ann Appl Stat. 2018;12(2):685–726. https://doi.org/10.1214/18AOAS1161SF.
Rojas, J.A.R., Beth Kery, M., Rosenthal, S., Dey, A.: Sampling techniques to improve big data exploration. In: 2017 IEEE 7th Symp. Large Data Analy Vis. 2017. 10.1109/LDAV.2017.8231848
Salloum S, Huang JZ, He Y. Random sample partition: a distributed data model for big data analysis. IEEE Trans Ind Informat. 2019;15(11):5846–54. https://doi.org/10.1109/TII.2019.2912723.
Mahmud MS, Huang JZ, Salloum S, Emara TZ, Sadatdiynov K. A survey of data partitioning and sampling methods to support big data analysis. Big Data Mining Anal. 2020;3(2):85–101.
He Y, Wu Y, Qin H, Huang JZ, Jin Y. Improved inice clustering algorithm based on density peaks mechanism. Inf Sci. 2021;548:177–90. https://doi.org/10.1016/j.ins.2020.09.068.
Xu X, Ding S, Wang Y, Wang L, Jia W. A fast density peaks clustering algorithm with sparse search. Inform Sci. 2021;554:61–83. https://doi.org/10.1016/j.ins.2020.11.050.
Rodriguez A, Laio A. Clustering by fast search and find of density peaks. Science. 2014;344(6191):1492–6. https://doi.org/10.1126/science.1242072.
Schubert E, Sander J, Ester M, Kriegel HP, Xu X. Dbscan revisited, revisited: why and how you should (still) use dbscan. ACM Trans Database Syst. 2017. https://doi.org/10.1145/3068335.
Patil C, Baidari I. Estimating the optimal number of clusters k in a dataset using data depth. Data Sci Eng. 2019;4:132–40.
Zhao X, Liang J, Dang C. A stratified sampling based clustering algorithm for largescale data. Know Based Syst. 2019;163:416–28. https://doi.org/10.1016/j.knosys.2018.09.007.
Jia J, Xiao X, Liu B, Jiao L. Baggingbased spectral clustering ensemble selection. Pattern Recognit Lett. 2011;32(10):1456–67. https://doi.org/10.1016/j.patrec.2011.04.008.
Wang Y, Chen L, Mei J. Incremental fuzzy clustering with multiple medoids for large data. IEEE Trans Fuzzy Syst. 2014;22(6):1557–68. https://doi.org/10.1109/TFUZZ.2014.2298244.
Hu J, Li T, Luo C, Fujita H, Yang Y. Incremental fuzzy cluster ensemble learning based on rough set theory. Know Based Syst. 2017;132:144–55. https://doi.org/10.1016/j.knosys.2017.06.020.
Bagirov AM, Ugon J, Webb D. Fast modified global kmeans algorithm for incremental cluster construction. Pattern Recognit. 2011;44(4):866–76. https://doi.org/10.1016/j.patcog.2010.10.018.
Mimaroglu S, Erdil E. Combining multiple clusterings using similarity graph. Pattern Recogn. 2011. https://doi.org/10.1016/j.patcog.2010.09.008.
Huang D, Lai J, Wang CD. Ensemble clustering using factor graph. Pattern Recognit. 2016;50(C):131–42. https://doi.org/10.1016/j.patcog.2015.08.015.
Ayad HG, Kamel MS. On votingbased consensus of cluster ensembles. Pattern Recognit. 2010;43(5):1943–53. https://doi.org/10.1016/j.patcog.2009.11.012.
IamOn N, Boongoen T, Garrett S, Price C. A linkbased approach to the cluster ensemble problem. IEEE Trans Pattern Anal Mach Intell. 2011;33(12):2396–409. https://doi.org/10.1109/TPAMI.2011.84.
Yang J, Liang J, Wang K, Rosin PL, Yang M. Subspace clustering via good neighbors. IEEE Trans Pattern Anal. 2020;42(6):1537–44. https://doi.org/10.1109/TPAMI.2019.2913863.
Fang Y, Wang J. Selection of the number of clusters via the bootstrap method. Comput Stat Data Anal. 2012;56(3):468–77. https://doi.org/10.1016/j.csda.2011.09.003.
Estiri H, Abounia Omran B, Murphy SN. kluster: an efficient scalable procedure for approximating the number of clusters in unsupervised learning. Big Data Res. 2018;13:38–51. https://doi.org/10.1016/j.bdr.2018.05.003.
Pelleg, D., Moore, A.W.: Xmeans: Extending kmeans with efficient estimation of the number of clusters. In: Proc. 17th Int. Conf. Mach. Learn. ICML ’00, pp. 727–734. Morgan Kaufmann Publishers Inc., CA, USA 2000.
Bachem, O., Lucic, M., Krause, A.: Scalable kmeans clustering via lightweight coresets. In: Proc. 24th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min. (KDD’18), NY, USA, pp. 1119–1127 (2018). 10.1145/3219819.3219973.
Wu J, Liu H, Xiong H, Cao J, Chen J. Kmeansbased consensus clustering: a unified view. IEEE Trans Knowl Data Eng. 2015;27(1):155–69. https://doi.org/10.1109/TKDE.2014.2316512.
IamOn N, Boongeon T, Garrett S, Price C. A linkbased cluster ensemble approach for categorical data clustering. IEEE Trans Knowl Data Eng. 2012;24(3):413–25.
Ren Y, Domeniconi C, Zhang G, Yu G. Weightedobject ensemble clustering: Methods and analysis. Knowl Inf Syst. 2017;51(2):661–89. https://doi.org/10.1007/s101150160988y.
Brock G, Pihur V, Datta S, Datta S. clvalid: an r package for cluster validation. J Stat Softw. 2008;25(4):1–22. https://doi.org/10.18637/jss.v025.i04.
Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979;1(2):224–7. https://doi.org/10.1109/TPAMI.1979.4766909.
Rosenberg, A., Hirschberg, J.: Vmeasure: A conditional entropybased external cluster evaluation measure. In: Proc. 2007 Joint Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn. (EMNLPCoNLL), pp. 410–420. Association for Computational Linguistics, Prague, Czech Republic 2007. 10.1109/10.7916/D80V8N84.
Lawrence H, Phipps A. Comparing partitions. J Classif. 1985;2(1):193–218. https://doi.org/10.1007/BF01908075.
Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837–54. https://doi.org/10.5555/1756006.1953024.
Acknowledgements
Not applicable.
Funding
This research has been supported by the National Natural Science Foundation of China Grant61972261.
Author information
Authors and Affiliations
Contributions
MS Mahmud: Conceptualization, Investigation, Software, Methodology, Validation, Writing  original draft; JZ Huang: Supervision, Methodology, Funding acquisition, Writing  review & editing; RR: Project administration, Validation, Writing  review & editing; KW: Funding acquisition, Resources. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Compeing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Mahmud, M.S., Huang, J.Z., Ruby, R. et al. An ensemble method for estimating the number of clusters in a big data set using multiple random samples. J Big Data 10, 40 (2023). https://doi.org/10.1186/s40537023007094
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023007094
Keywords
 Ensemble learning
 Number of clusters
 Random sample partition
 Cluster ball model
 Approximate computing