 Research
 Open Access
 Published:
Big data: an optimized approach for cluster initialization
Journal of Big Data volume 10, Article number: 120 (2023)
Abstract
The kmeans, one of the most widely used clustering algorithm, is not only faster in computation but also produces comparatively better clusters. However, it has two major downsides, first it is sensitive to initialize k value and secondly, especially for larger datasets, the number of iterations could be very large, making it computationally hard. In order to address these issues, we proposed a scalable and costeffective algorithm, called Rkmeans, which provides an optimized solution for better clustering large scale highdimensional datasets. The algorithm first selects O(R) initial points then reselect O(l) better initial points, using distance probability from dataset. These points are then again clustered into k initial points. An empirical study in a controlled environment was conducted using both simulated and real datasets. Experimental results showed that the proposed approach outperformed as compared to the previous approaches when the size of data increases with increasing number of dimensions.
Introduction
Clustering plays a crucial role in unsupervised learning, encompassing a wide array of applications across various domains. The primary goal of clustering is to organize data in a manner that promotes grouping of similar datapoints within the same clusters, while ensuring that dissimilar clusters are wellseparated. It is presumed that the number of clusters and the initial points are known. There are several clustering techniques developed to find the patterns in unlabeled data such as Partition based clustering [1], Hierarchical Agglomerative clustering (HAC) [2], DBSCAN [3], Gaussian Mixture Models (GMM) [4], and Spectral Clustering [5].
The clustering problem is commonly defined as the problem of minimizing objective function while clustering the datapoints. The most commonly used objective function in clustering is sum of squared error (SSE) [6], which is computed as the squared distance between the datapoints and their respective clusters centroid. Thus, the aim of the objective function is to find the clusters with minimum internal variance. Generally, the selection of an appropriate objective function depends on the specific problem, data characteristics, and the desired outcome of the clustering task.
The kmeans [7] is one of the simplest and familiar clustering algorithms, based on Lloyd’s algorithm [8]. It works by dividing the data points into k clusters, where k is specified by the user. The algorithm assigns each data point to the cluster whose mean (centroid) is closest to the datapoint. The kmeans clustering is a popular and easytoimplement algorithm that can be used for a variety of applications, such as image processing [9] and image segmentation [10], data summarization [11], text clustering [12] and sound source angle estimation [13]. However, it has some limitations, including the need to specify the number of clusters in advance, and the sensitivity of the results to the initial placement of the centroids.
Although kmeans widely used in clustering, its nonprobabilistic nature and adoption of a simple radial distance metric to assign cluster membership make it challenging in terms of performance, especially for high dimensional large scale datasets [14]. It thus really becomes challenging for existing kmeans based clustering algorithms, in the big data domain [15], to cluster the data in an optimal way [16]. One of the main problems with kmeans is the cluster initialization as the initial selection of centroids greatly influences the performance of Kmeans. Different initializations can lead to different final clustering results. If the initial centroids are randomly chosen, the algorithm may converge to a suboptimal solution or get stuck in local optima. Due to its iterative nature of kmeans, the algorithm may converge to local optima rather than the global optimum. Poor initialization can also lead to clusters with unequal size. There has been a significant emphasis in recent research on cluster initialization methods specifically designed for largescale, highdimensional datasets as better initialization dramatically improved the performance of Lloyd iteration in terms of convergence and quality [17].
A significant advancement in this direction was made by kmeans++ algorithm [18], in which the initial point is chosen randomly, and subsequent points are selected using probability distribution that ensures the selected center is dissimilar to the ones already chosen. The downside of kmeans++ is that the initialization phase requires k sequential passes over data. This is because the selection of new points relies on the previously selected points, making it challenging to parallelize. Another approach, known as kmeans [19], proposed a variant of kmeans++, specifically designed for parallel initialization of cluster centroids. The algorithm speeds up the process of kmeans++ by sampling l times more points in each round independently. Independent sampling speeds up the process of the initialization but the quality of selected centers is not good as kmeans++ leading to increase in Lloyd iteration for convergence.
In this work, we propose a variant of the kmeans++ clustering algorithm, which is comparatively scalable and costeffective for clustering large scale highdimensional datasets. Keeping kmeans as a baseline, the proposed algorithm introduces one more optimization factor R, which ensures the selected clusters are far away from each other.
The proposed algorithm aims to minimize the problem of centroid local minima by proposing an R = θk optimization factor. The R is an optimization factor that selects the center more than desired centers i.e. R ≥ k. The main idea behind our algorithm is to select O(k) points in each round and pass these points to kmeans++ to select O(l) points that are far away from each other. The process repeats for O(logn) times and finally leaves with O(llogk) points which then again reclusters into k points. The algorithm guarantees to reduce the cost of Lloyd’s step.
In order to evaluate the proposed algorithm, an experimental evaluation is conducted using some real and artificial datasets to compare the clustering quality of Rkmeans with the stateoftheart clustering techniques. The SSE, a very popular internal evaluation metric, is calculated before and after initialization process. For statistical evaluation, the oneway analysis of variance (ANOVA) is used to determine whether there is any significant difference between the performance of the proposed algorithm as compared to kmeans++ and kmeans.
The main contributions of this study are the following:

In “The proposed algorithm: Rkmeans” section, we introduce a scalable algorithm named Rkmeans, specifically designed for clustering largescale datasets. Our algorithm incorporates a novel factor R, which enhances the initialization process in kmeans, leading to improved clustering results.

In “Algorithm evaluation” section, we present an empirical evaluation of our proposed algorithm, showcasing its effectiveness in the context of clustering large datasets. The evaluation encompasses various datasets, highlighting the algorithm’s performance and its capability to handle big data.

In “Result validation” section, we provide a statistical evaluation of the performance of our proposed algorithm.
The rest of the paper is organized as follows; in “Related work” section we present a detailed discussion of related works to show the research gap. In “The proposed algorithm: Rkmeans” section, we present our proposed algorithm with detailed illustration. In “Empirical evaluation of proposed algorithm” section, we discuss the results of our empirical experiments. Finally, in “Conclusion” section conclusion is presented.
Related work
The problem of Clustering has been addressed in a variety of contexts. There are several different variants for the kmeans algorithm available in the literature, covering from initial k parameter selection to generating proper “seeding” with different objective function and data reduction schemes to reduce the number of iterations.
The kmeans clustering has emerged as one of the most rated data mining algorithms [20] due to its simplicity and ease of usage. However, the algorithm usually influenced by the number of clusters and how each cluster is initialize. In general the validity indices can be used to find the optimal number of clusters, divided into two categories: internal and external index [21]. External indices uses the prior structure or reference results label to find the number of clusters [22, 23]. Internal indices used the internal data for finding the goodness of cluster structure. Silhouette Width (SW), Dunn’s index, Davies–Bouldin index (DB), Bayesian information criteria, Calinski and Harabasz index (CH) and Gap statistic are popular internal validity indices for kmeans clustering algorithm. Other approaches are also proposed in literature, MM [24], Ukmeans [25], Xmeans [26], Gmeans [27] uses validity indices as a model and range of cluster numbers.
Cluster with different k initial values produce different clustering results especially for large scale datasets. In many kmeans investigations, the best initial values of k are determined using a subsample of the data. The continuous kmeans algorithm [28] selects referenced points as a random sample from the whole datasets and in each iteration examines only random sample of datapoints. The algorithm is faster but the result is not necessarily global minimum. In [29] the kmeans Mod algorithm is applied on the J random subsample of dataset to choose k centroids and again run kmeans algorithm on selected points for each subsample. The final k centers are chosen based on the minimal distortion value. The algorithm perform better for small datasets. Similarly, the algorithm [30] divides each attribute of the datasets into k fixed number of cluster and compute the percentile. The attribute values calculated using mean and standard deviation that serves as the seed for that attributes. The densitybased data condensation method is used to merge the resulting centroids into k cluster. The algorithm reduces the cost of Lloyd’s step however handling high dimensional data is challenging. In [31] greedy deletion procedure is used to select k centroids from the bulk of random points in the dataset. The Ballkmeans seeding step that considers the ball of radius around each center and moves the center to the centroid of the ball, is used to obtain the final centers. They also showed that Lloyd’s algorithm performs well if the data satisfies a natural separation condition of clustering and return optimal clustering. The algorithm provides the optimal initial centers that required no or minimum Lloyd iteration. But the overall running time is O(nkd+3kd). In the same context [32], the input datasets is divided into m number of groups and runs kmeans++ in each group. The algorithm selects 3log(k) points in each iteration and at the end, it reclusters these 3mlog(k) points into k using s scheme or any of the kmeans algorithm. The advantage of this algorithm over kmeans++ is that it can be implemented in parallel, as each group of input is assigned to different machines but the running time of partition does not improve when the number of machines surpasses the threshold. In [33], the data is sampled from tmixture distribution. The tmixture distribution is heavy tailed Gaussian distribution. This tmixture model based distributed data is then analyzed from the aspect of loss function. The proposed method is stable in terms of variance of multiple results.
Some approaches also focus on the overall computational complexity associated with the Lloyd’s step in kmeans algorithm. The QuicKmeans [34], which is based on the Fast Transform by reducing the complexity of applying linear operators in high dimension by approximately factorizing the corresponding matrix into few sparse factors. The approach more focuses on fast convergence of clusters and hence optimizes Lloyd’s steps, however ignoring cluster initialization. The Ball kmeans algorithm [35] divides different cluster, represents as ball, into active, stable and annular area. The distance calculation is performed only on annular area of neighboring clusters. Another notable approach presented in a literature Ikmeans+ [36], which iteratively remove and divide pair of clusters and perform reclustering. The solution used to minimize the objective function of clustering. The PkCIA [37] computes initial cluster centers by using eigenvector as an indexes. The approach enable to identify meaningful clusters.
To overcome the issue of accelerating the clustering process, many parallelization techniques are employed. The parallel kmeans clustering algorithm [38] use MapReduce framework to handle large scale data clustering. The map function assigns each point to closest center and reduce function updates the new centroids. To demonstrate the wellness of algorithm, different experiments perform on scalable datasets. Another MapReduce based method [39], reduces the MapReduce job as it uses one MapReduce job to select k initial centers. The kmeans++ initialization phase runs on mapper and the weighted kmeans++ runs on the reducer phase. It overcomes the problem of kmeans to run multiple MapReduce jobs for initialization. Parallel batch kmeans [40] divide the dataset into equal partition by preserving the characteristics of the data. The kmeans apply on each partition to reduce computational complexity of big dataset but not provide accurate no of clusters. In the same environment, two initialization methods for the kmeans [41] were proposed. The first method used the divide and conquers strategy on kmeans approach using subset sampling. In the second approach random projection was used along with subsampling, to project high dimension space into lower dimension space and then initialization perform. The algorithm guarantees to perform better than stateoftheart methods. In [42], another recent entropy based initialization method is proposed. The algorithm uses Shannon’s entropy based objective function for similarity measure. The proposed algorithm also aims to detect the optimal no of k for faster convergence. In [43], the random initialization method is merging the bootstrap technique. First, the algorithm applies kmeans to B number of bootstrap replications of data and selects k initial centers from each bootstrap dataset. Then clustering is performed on B∗k set of centers, to get the k new clusters. Instead of selecting the average points, the deepest point is considered a cluster center. The algorithm aims to perform better than the previous proposal algorithm of initialization. In [44] an algorithm named as patternbased clustering for categorical datasets, uses MFIM (Maximal frequent item sets mining) algorithm to find list of MFIs for initial cluster. Then it uses a kernel density estimation (KDE) method to estimate the local density of datapoints for the formation of cluster. Another technique [45] employs KDE, to create the balance between majority and minority clusters by estimating the better approximation of the distribution. In general, KDE based clustering techniques perform well for data with complex distribution however require high computation. In [46] the density based clustering algorithm (DBSCAN) used as a preprocessing step, to find the initial cluster center before applying kmeans algorithm. In [47] Kmeans9+ model the comparison steps after randomly chosen centroids improved, by comparing with the current and eight nearest neighbor cluster partitions. The algorithm improve the efficiency by reducing the unnecessary comparison. In [48] the new algorithm FCKmeans improved the clustering performance by preventing some cluster centroids from updating in all iteration by fixing them on real world condition. In [49] power kmeans++ the combination of power kmeans and kmeans++ presented to improve the clustering performance. The algorithm utilizes the kmeans++ for good initial starting points then final alternative cluster centers using power kmeans algorithm.
In summary, the aforementioned approaches highlight the strengths and limitations of different variants of kmeans, aiming to enhance the overall performance of clustering. However, it is important to note that no single approach is universally applicable to all situations, and there are still numerous research gaps and challenges that need to be addressed. One significant challenge is the cluster initialization step, as the selection of initial centroids profoundly impacts the performance of kmeans. Different initialization methods can lead to varying clustering outcomes, with random initialization often resulting in suboptimal solutions or local optima convergence. Inadequate initialization can also lead to clusters of unequal sizes. Recent research has placed considerable emphasis on developing cluster initialization methods tailored for largescale, highdimensional datasets. Improved initialization techniques, such as the kmeans [19], have shown good results in terms of convergence and clustering quality. The paper proposes an algorithm that aims to minimize the problem of centroid local minima, ensuring costeffective and comparatively scalable improvements in the initialization phase to achieve better clustering results with comparatively good performance, especially for large scale highdimensional datasets.
The proposed algorithm: Rkmeans
The algorithm, known as Rkmeans (k, l, R), is a variant of kmeans++ inspired by kmeans for initializing the centers. While the proposed algorithm is largely inspired by kmeans, it also uses an oversampling factor l and proposed optimization factor R. In (1) step the proposed algorithm chooses l, R constants and k number of desired clusters. It then picks an initial center (say, uniformly at random) and computes the ψ ← φX(C) i.e. the sum of all smallest 2norm distances (Euclidean Distance) from all points set X to all points from C. In other words, for each point in X, the algorithm will find the distance to the closest point in C. In the end, it computes the sum of all those minimal distances, one for each point in X. It then runs log(ψ) iterations as mentioned in (3) step. In each iteration, it selects l∗R center points using probability distance measurement and then runs log (l*R) times and reclusters the selected C′ point into l points by using kmeans++ to ensure that intracluster distance between points is far away from each other. In each iteration, the algorithm includes selected points from C″ into C. After the completion of the iteration, the algorithm reclusters the selected weighted points into k clusters. For reclustering of Step 8 kmeans++ is used.
Empirical evaluation of proposed algorithm
In this section, the results of Rkmeans, kmeans++, and kmeans have been analyzed on 08 different datasets using the same control environment.
Experimental setup
The sequential version of the kmeans algorithm is evaluated on a single machine quadcore 2.5 GHz processor and 16 GB memory. The parallel version of the algorithm is run by using a Hadoop cluster of 40 nodes, created on Microsoft Azure, each with 16 GB of memory. The datasets used in the experiment are discussed in the next section.
Datasets
The datasets that are used in the experiments are the same as those are used in [18, 19, 29] algorithms. A number of the datasets analyzed in previous work were not particularly large, the main objective of the proposed algorithm is to cluster large datasets, which are difficult to fit in the main memory. Three large datasets ActivityRecognition, 3DRoadNetwork, and the AlltheNews datasets along with 3 benchmark datasets are used for the experiments. These datasets are taken from the UCI Machine learning repository. Two other synthetic datasets are also used. The Summary of all datasets is presented in Table 1.
Optimal number of k
As discussed before, in kmeans clustering the number of clusters k is already randomly selected prior to running the algorithm. There are different ways to determine the right number of k. To demonstrate the performance and quality evaluation of the proposed algorithm in a more transparent manner, we select the initial value of k that fits the data. To determine which number of clusters k is more optimum for the dataset, or find cluster fitness, two wellknown techniques on a random subset (samples) of data are used, i.e., the Silhouette Score and Elbow Method using SSE. These methods are standard evaluation methods for choosing the optimum number of clusters.
An Elbow analysis is used to visually observe the number of clusters in each dataset. The Fig. 1 demonstrates the number of k (xaxis) for each dataset against computed SSE values. An optimum number of k can be obtained with minimum SSE value. It should be noted here that after determining the range of k from 2 to 14 according to the empirical rules, WCSS (WithinCluster Sum of Square) is the sum of the squared distance between each point and the centroid in a cluster is calculated for each value of k. When plotting the WCSS against the number of clusters, the resulting graph exhibits an elbow shape. The point where the graph’s slope exhibits a sudden change indicates the optimal number of clusters. For example, in the case of the iris dataset, k = 3 represents the optimal number of clusters, while for the News dataset, k = 5 is deemed optimal.
The Table 2 presents Silhouette analysis to cross validate the result of Elbow method. This technique provides a measure of how well each data point fits within its assigned cluster and aids in determining the number of clusters. The Table 2 shows the Silhouette score of each dataset. The highest silhouette coefficient value suggests that the point is wellmatched to its own cluster and poorly matched to another cluster. For instance, values like 0.84 (k = 2) and 0.75 (k = 3) in ‘iris’ dataset. It is important to note that in our analysis, we examined the results of both techniques and selected the most appropriate value for k that satisfies both approaches.
Algorithm evaluation
In order to demonstrate a comparative evaluation, all eight datasets were utilized to compute and compare the objective functions, namely intercluster and intracluster sum of squared errors (SSEs), for three algorithms: kmeans++, kmeans, and the proposed algorithm Rkmeans, using various threshold values for l and R factors (“The proposed algorithm: Rkmeans” section). The evaluation of these approaches was conducted in two phases, which include the initialization phase and the final cluster formation phase, as better initialization leads to improved cluster formation. In initialization phase of Rkmeans, multiple data points, say R, are drawn in each iteration C_{i} from the dataset and producing l estimates of the true cluster locations using kmeans++. To find the best initial centroids, these l points (C solutions, each having l clusters) are weight into k centroids in an “optimal” fashion.
The Fig. 2 and Table 3 demonstrate the evaluation of initialization phase using intercluster SSEs (yaxis), threshold value of l (xaxis), indicating dissimilarities between clusters. In the smaller datasets, kmeans demonstrates good performance in the ‘iris’ dataset with SSE = 61.608 when l = 4. However, Rkmeans performs well with SSE = 84.63 when l = 6 and R = 10 in the same dataset. In the ‘sonar’ dataset, Rkmeans outperforms other algorithms with SEE = 50.9349 at l = 5 and R = 15. The performance improvement is even more significant, with higher SSEs, in the larger document datasets (3D SN, News, and Activity), where Rkmeans consistently outperforms kmeans and kmeans++ algorithms.
Figure 3 and Table 4 illustrate the evaluation results of the final cluster formation phase using the intracluster SSEs (yaxis) and the threshold value of l (xaxis), indicating similarities within clusters. In all datasets, Rkmeans outperforms other algorithms by achieving the smallest SSE values when larger values of l and R are selected. This suggests that Rkmeans algorithm consistently produces better quality clusters compared to the other approaches.
Result validation
To conduct a statistical evaluation of the results obtained from the proposed algorithm compared to kmeans++ and kmeans, a oneway ANOVA test is employed. The null hypothesis (H_{0}) in Eq. (4) assumes that there is no improvement in the clustering results, and all approaches perform equally. The alternative hypothesis (H_{A}) states that there is a statistically significant difference in performance between the proposed approach and the other two methods.
The oneway ANOVA test compares the means of multiple groups and determines whether there is a significant difference among them. In this case, the group means being compared are the results obtained by kmeans++ (\(\mu_{1}\)), kmeans (\(\mu_{2}\)), and the proposed approach (\(\mu_{3}\)).
Figure 4 shows the result of the ANOVA test. To perform the oneway ANOVA test, the performance metric (IntraCluster SSE) is collected for each algorithm across multiple datasets presented in Table 3. The null hypothesis is then tested by analyzing the variance between the groups (algorithms) and the variance within each group. The result of the statistical test showed that the proposed approach in fact performed well as pvalue (0.019) is smaller than the alpha (0.05). So the H_{0} hypothesis is rejected.
Conclusion
Clustering largescale data is a challenging task. This work addressed the problem of initialization of the kmeans clustering algorithm for the large datasets, especially for the document data. The traditional kmeans algorithm largely depends on the choice of cluster initial centers. The kmeans++ is the most popular technique to deal with the issue of initialization in kmeans, however due to its sequential nature, it is hard to apply in big data scenarios. A promising approach, known as kmeans, has recently emerged to address the issue of initialization in kmeans for big datasets. This paper proposes an algorithm, called Rkmeans as a variant of kmeans++, to offer a comparatively better solution to the problem of cluster initiation for big datasets. Using inter and intra cluster SSEs as an evaluation metrics, experimental results show that the proposed approach performs comparatively better. At each iteration of cluster initialization process, SSE of proposed approach is greater than that of kmeans++ and kmeans, which shows that the centers selected in each iteration are far away from each other, thus reducing the cost of convergence in Lloyd’s algorithm. The quality of final clusters was also assessed by using intra SSE, a very popular metric for cluster evaluation. The results also shows that SSE of proposed approach is much lesser than that of others, suggesting better clusters. Finally, in order to statistically validate the performance, oneway ANNOVA was performed. The result of the statistical test shows that the proposed approach in fact performs well as pvalue (0.019) is smaller than the alpha (0.05).
Availability of data and materials
Not applicable.
References
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Fifth Berkeley symposium on mathematics. Statistics and probability. Berkeley: University of California Press; 1967. p. 281–97.
Ward JH Jr. Hierarchical grouping to optimize an objective function. J Am Stat Assoc. 1963;58(301):236–44.
Ester M, Kriegel HP, Sander J, Xu X. A densitybased algorithm for discovering clusters in large spatial databases with noise. In: kdd, vol. 96, No. 34; 1996. p. 226–31.
Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc: Ser B (Methodol). 1977;39(1):1–22.
Ng A, Jordan M, Weiss Y. On spectral clustering: analysis and an algorithm. In: Advances in neural information processing systems, 14; 2001.
Aloise D, Deshpande A, Hansen P, Popat P. Nphardness of Euclidean sumofsquares clustering. Mach Learn. 2009;75(2):245–8.
Jain AK. Data clustering: 50 years beyond kmeans. Pattern Recogn Lett. 2010;31(8):651–66.
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129–37.
Kwedlo W, Czochanski PJ. A hybrid MPI/OpenMP parallelization of kmeans algorithms accelerated using the triangle inequality. IEEE Access. 2019;7:42280–97.
He L, Zhang H. Kernel kmeans sampling for Nyström approximation. IEEE Trans Image Process. 2018;27(5):2108–20.
Ahmed M. Data summarization: a survey. Knowl Inf Syst. 2019;58(2):249–73.
Alhawarat M, Hegazi M. Revisiting kmeans and topic modeling, a comparison study to cluster Arabic documents. IEEE Access. 2018;6:42740–9.
Yang X, Li Y, Sun Y, Long T, Sarkar TK. Fast and robust RBF neural network based on global kmeans clustering with adaptive selection radius for sound source angle estimation. IEEE Trans Antennas Propag. 2018;66(6):3097–107.
McCallum A, Nigam K, Ungar LH. Efficient clustering of highdimensional data sets with application to reference matching. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining. 2000. p. 169–78.
Oussous A, Benjelloun FZ, Lahcen AA, Belfkih S. Big data technologies: a survey. J King Saud Univ Comput Inf Sci. 2018;30(4):431–48.
Sreedhar C, Kasiviswanath N, Reddy PC. Clustering large datasets using kmeans modified inter and intra clustering (KMI2C) in Hadoop. J Big Data. 2017;4(1):1–19.
Fränti P, Sieranoja S. How much can kmeans be improved by using better initialization and repeats? Pattern Recognit. 2019;93:95–112.
Arthur D, Vassilvitskii S. kmeans++: the advantages of careful seeding. Technical report, Stanford; 2006.
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S. Scalable kmeans++. arXiv preprint. 2012. arXiv:1203.6402.
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, et al. Top 10 algorithms in data mining. Knowl Inf Syst. 2008;14(1):1–37.
Rendón E, Abundez I, Arizmendi A, Quiroz EM. Internal versus external cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34.
Lei Y, Bezdek JC, Romano S, Vinh NX, Chan J, Bailey J. Ground truth bias in external cluster validity indices. Pattern Recogn. 2017;65:58–70.
Wu J, Chen J, Xiong H, Xie M. External validation measures for kmeans clustering: a data distribution perspective. Expert Syst Appl. 2009;36(3):6050–61.
Jahan M, Hasan M. A robust fuzzy approach for gene expression data clustering. Soft Comput. 2021;25(23):14583–96.
Sinaga KP, Yang MS. Unsupervised kmeans clustering algorithm. IEEE Access. 2020;8:80716–27.
Pelleg D, Moore AW, et al. Xmeans: extending kmeans with efficient estimation of the number of clusters. In: Icml. 2000. p. 727–34.
Hamerly G, Elkan C. Learning the k in kmeans. In: Advances in neural information processing systems; 2003. p. 16.
Faber V. Clustering and the continuous kmeans algorithm. Los Alamos Sci. 1994;22(138144.21):67.
Bradley PS, Fayyad UM. Refining initial points for kmeans clustering. In: ICML. 1998. p. 91–9.
Khan SS, Ahmad A. Cluster center initialization algorithm for kmeans clustering. Pattern Recogn Lett. 2004;25(11):1293–302.
Ostrovsky R, Rabani Y, Schulman LJ, Swamy C. The effectiveness of Lloydtype methods for the kmeans problem. J ACM. 2013;59(6):1–22.
Ailon N, Jaiswal R, Monteleoni C. Streaming kmeans approximation. In: NIPS. 2009. p. 10–8.
Li Y, Zhang Y, Tang Q, Huang W, Jiang Y, Xia ST. tkmeans: a robust and stable kmeans variant. In: ICASSP 2021–2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). 2021. p. 3120–4.
Giffon L, Emiya V, Kadri H, Ralaivola L. QuicKmeans: accelerating inference for Kmeans by learning fast transforms. Mach Learn. 2021;110:881–905.
Xia S, Peng D, Meng D, Zhang C, Wang G, Giem E, Wei W, Chen Z. Ball k kmeans: fast adaptive clustering with no bounds. IEEE Trans Pattern Anal Mach Intell. 2020;44(1):87–99.
Ismkhan H. Ikmeans−+: an iterative clustering algorithm based on an enhanced version of the kmeans. Pattern Recogn. 2018;79:402–13.
Manochandar S, Punniyamoorthy M, Jeyachitra RK. Development of new seed with modified validity measures for kmeans clustering. Comput Ind Eng. 2020;141: 106290.
Zhao W, Ma H, He Q. Parallel kmeans clustering based on MapReduce. In: IEEE international conference on cloud computing. 2009. p. 674–9.
Xu Y, Qu W, Li Z, Min G, Li K, Liu Z. Efficient kmeans++ approximation with MapReduce. IEEE Trans Parallel Distrib Syst. 2014;25(12):3135–44.
Alguliyev RM, Aliguliyev RM, Sukhostat LV. Parallel batch kmeans for Big data clustering. Comput Ind Eng. 2021;152: 107023.
Hämäläinen J, Kärkkäinen T, Rossi T. Scalable initialization methods for largescale clustering. arXiv preprint. 2020. arXiv:2007.11937.
Chowdhury K, Chaudhuri D, Pal AK. An entropybased initialization method of kmeans clustering on the optimal number of clusters. Neural Comput Appl. 2021;33(12):6965–82.
Torrente A, Romo J. Initializing kmeans clustering by bootstrap and data depth. J Classif. 2020;38:1–25.
DuyTai D, VanNam H. kPbC: an improved cluster center initialization for categorical data clustering. Appl Intell. 2020;50(8):2610–32.
Bortoloti FD, de Oliveira E, Ciarelli PM. Supervised kernel density estimation Kmeans. Expert Syst Appl. 2021;168: 114350.
Fahim A. K and starting means for kmeans algorithm. J Comput Sci. 2021;55: 101445.
Abdulnassar AA, Nair LR. Performance analysis of Kmeans with modified initial centroid selection algorithms and developed Kmeans9+ model. Meas Sens. 2023;25: 100666.
Ay M, Özbakır L, Kulluk S, Gülmez B, Öztürk G, Özer S. FCKmeans: fixedcentered Kmeans algorithm. Expert Syst Appl. 2023;211: 118656.
Li H, Wang J. Collaborative annealing power kmeans++ clustering. KnowlBased Syst. 2022;255: 109593.
Acknowledgements
Not applicable.
Funding
The authors received no Funding for this research study.
Author information
Authors and Affiliations
Contributions
Both authors contributed equally to this work. MAR formulated and designed the solution, reviewed the manuscript, and supervised the whole work. MG contributed to the implementation of the research, processed the experimental data, drafted the manuscript and designed the figures.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Yes.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
In this section, the overview of the basic composition of the existing clustering algorithms is discussed.
kmeans
The kmeans, depicted in Algorithm 2, clustering method, based on expectation maximization (EM) algorithm, divides the group of n objects into k partition clusters and measures similarity by calculating the distance between each of the items with its mean value i.e. centroid. The kmeans clustering splits objects n into clusters k with each object in a cluster matching the nearest mean; for kmeans clustering k (the clustering mean) must be picked before the clustering process and computed from data. The k must be chosen prior to clustering and must be computed from data. The aim is to produce exactly k different clusters of greatest possible distinction by minimize the objective function O and maximize the L i.e. intercluster distance.
Definition 1
The equation below represents the objective function O where k is the number of centroids and m is the number of object assign to a particular cluster centroid.
Here O is the distance between points within the cluster i.e. Sum of squared distances. We use Euclidean metric for distance measurement, which is defined as:
Definition 2
Let we have X = x_{1}, x_{2}, x_{3},…, x_{n} data points, assuming that n objects are clustered into k clusters, the intercluster distance is defined as the sum of the difference of mean distances of all clusters to the mean of the entire samples.
Here L is calculated using m_{i} the mean of cluster c_{i} and the m the mean of all n data points. The kmeans algorithm is defined as follow:
Initially, the algorithm chooses the k centroids randomly from the space of data and assigns each object from data space to the centroid which has the minimum distance from the point. When all objects are assigned to the group, then the position of these k centroids is recomputed by calculating the mean of the points assigned to them on the basis of distance measure. These steps have to be repeated until convergence or until the centroids no longer move. The kmeans algorithm is based on the famous Lloyd’s algorithm. The kmeans algorithm is scalable, computationally faster, and produces a tighter cluster for smallscale datasets but there are two major issues when it is applied to big data. It is sensitive to initialize k value and for large datasets, the number of iterations can be very large, making it computationally expensive. Respectively, each step of the method needs computation of the distance between every pair of the data points and the interdistance comparisons.
kmeans++
Various methods have been devised to solve the problems of kmeans, for instance, kmeans++, depicted in Algorithm 3, proposed an improved local potential version of kmeans with the D^{2} weighting. The algorithm focuses on the initialization of k clusters, to improve the quality of the cluster and to minimize the number of iterations. The kmeans++ chooses centers one by one in a controlled fashion. It selects the first center randomly from the dataset and then each subsequent center is selected using the probability proportional to the overall SSE given by the previously selected centroids. Preferably the algorithm achieves good clustering by preferring the centers that are far away from the previously selected points.
The algorithm chooses a center x_{i} randomly from a dataset then other k1 centers are chosen one by one from the dataset with probability \({{D(x)^{2} } \mathord{\left/ {\vphantom {{D(x)^{2} } {\mathop \sum \limits_{x \in X} D(x)^{2} }}} \right. \kern0pt} {\mathop \sum \limits_{x \in X} D(x)^{2} }}\).
The process is sequential, it thus repeats until k objects are selected for centers, making the complexity O (n k d), for n points in d dimension, same as that of a single Lloyd iteration. The central downside of kmeans++, from a scalability point of view, is of inherent sequential nature—the choice of the next center is conditioned to the current set of centers.
kmeans
Another improved form of kmeans is kmeans, basically designed to overlay the drawback of kmeans++ in terms of scalability. The kmeans, depicted in Algorithm 4, uses oversampling factor l = ω(k) for choosing k centers. Like kmeans++ the algorithm selects the first center randomly from the dataset and then other centers are chosen with probability l \({{D(x)^{2} } \mathord{\left/ {\vphantom {{D(x)^{2} } {\mathop \sum \limits_{x \in X} D(x)^{2} }}} \right. \kern0pt} {\mathop \sum \limits_{x \in X} D(x)^{2} }}\).
The algorithm selects first point x_{i} randomly and then computes the initial cost (ψ) after this selection. It then iterates for log (ψ) times, in each iteration it selects l centroids from the X dataset. The number of centroids is expected to be l time log(ψ) + 1, which is more than the number of k. In order to reduce the selected centroids, the algorithm assigns weight to these selected centers and then recluster the weighted points into k. The kmeans parallel runs in the fewer number of iterations, better in terms of running time, and clustering cost is expected to be much better than random initialization but it is not guaranteed that selected centroids are far away from each other, increasing the cost of reclustering.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Gul, M., Rehman, M. Big data: an optimized approach for cluster initialization. J Big Data 10, 120 (2023). https://doi.org/10.1186/s40537023007981
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s40537023007981
Keywords
 kmeans
 Cluster initialization
 Large scale data