A clustering algorithm for multivariate data streams with correlated components
 Giacomo Aletti^{1} and
 Alessandra Micheletti^{1}Email authorView ORCID ID profile
Received: 7 August 2017
Accepted: 6 December 2017
Published: 20 December 2017
Abstract
Common clustering algorithms require multiple scans of all the data to achieve convergence, and this is prohibitive when large databases, with data arriving in streams, must be processed. Some algorithms to extend the popular Kmeans method to the analysis of streaming data are present in literature since 1998 (Bradley et al. in Scaling clustering algorithms to large databases. In: KDD. p. 9–15, 1998; O’Callaghan et al. in Streamingdata algorithms for highquality clustering. In: Proceedings of IEEE international conference on data engineering. p. 685, 2001), based on the memorization and recursive update of a small number of summary statistics, but they either don’t take into account the specific variability of the clusters, or assume that the random vectors which are processed and grouped have uncorrelated components. Unfortunately this is not the case in many practical situations. We here propose a new algorithm to process data streams, with data having correlated components and coming from clusters with different covariance matrices. Such covariance matrices are estimated via an optimal double shrinkage method, which provides positive definite estimates even in presence of a few data points, or of data having components with small variance. This is needed to invert the matrices and compute the Mahalanobis distances that we use for the data assignment to the clusters. We also estimate the total number of clusters from the data.
Keywords
Introduction
Clustering is the (unsupervised) division of a collection of data into groups, or clusters, such that points in the same cluster are similar, while points in different clusters are different. When a large volume of (not very high dimensional) data is arriving continuously, it is impossible and sometimes unnecessary to store all the data in memory, in particular if we are interested to provide real time statistical analyses. In such cases we speak about data streams, and specific algorithms are needed to analyze progressively the data, store in memory only a small number of summary statistics, and then discard the already processed data and free the memory [7]. Data streams are for example collected and analyzed by telecommunication companies, banks, financial analysts, companies for online marketing, private or public groups managing networks of sensors to monitor climate or environment, technological companies working in IoT, etc. In this framework, there are many situations in which clustering plays a fundamental role, like customer segmentation in big ecommerce web sites, for personalized marketing solutions, image analysis of video frames for objects recognition, recognition of human movements from data provided by sensors placed on the body or on a smartwatch, monitoring of hacker attacks to a telecommunication system, etc.
Related literature
The methods for cluster analysis present in literature can be roughly classified into two main families: probabilitybased methods (see e.g. [1]), which are based on the assumption that clusters come from a mixture of distributions, from a given family. In such case the clustering problem is reduced to the parameter estimation. These algorithms are well suited to detect the presence of nonspherical or nested clusters, but are based on specific assumptions on the data distribution, the number K of clusters is fixed at the very beginning, and, more important, they require multiple scans of the dataset to estimate the parameters of the model. Thus they cannot be applied to massive datasets or data streams.
The second family of clustering algorithms is composed by distancebased approaches. Given a dataset of size n, grouped into K clusters, such methods have usually the goal to find the K centers of the clusters which minimize the mean squared distance between the data and their closest centers. These methods usually take different names depending on the type of considered distance. If the Euclidean distance is used, the corresponding method is the classical and very popular Kmeans method (see e.g. [11]), which is probably the most diffused clustering algorithm, because of its simplicity. Anyway the exact solution of the minimization problem connected with Kmeans is NPhard, and only local search approximations are implemented. The method is sensitive to the presence of outliers and to the initial guess for the centers, but improvements both in terms of speed and accuracy of the algorithm have been implemented in Kmeans++ [2], which exploits a randomized seeding technique. Unfortunately both the classical Kmeans and the Kmeans++ algorithms require multiple scans of the dataset or a random selection from the entire dataset, in order to solve the minimization problem. Since data streams cannot be scanned several times and we cannot (randomly) access to the entire dataset all together, also these methods are not suitable for clustering data streams.
When the elements to be clustered are not points in \(\mathbb {R}^d\) but more complex objects, like functions or polygons, other clustering algorithms are used, like PAM, CLARA, CLARANS, [12, 15], which are based on noneuclidean distances defined on suitable spaces. These methods are looking for medoids, instead of means, which are the “most central elements” of each cluster and are selected from the points in the dataset. Also these algorithms cannot be efficiently applied to analyse data streams, since they either require multiple scans of the sample, or the extraction of a subsample to identify the centroids or medoids, then all data are scanned according to such identification and the medoids are not any more updated with the information coming from the whole dataset. Actually such popular methods are suited for data which are very high dimensional (e.g. functions) or for geometrical or spatial random objects, but not for datasets with an high number of (rather small dimensional) data.
In many situations the quality of the clustering is improved if a local metric is used. A local metric is a distance which takes into account the shape of the “cloud” of data points in each cluster to assign the new points (see Fig. 1).
 (a)
The retained set (RS) The set of data points which are not recognized to belong to any cluster, and need to be retained in the buffer;
 (b)
The discard set (DS) The set of data points which can be discarded after updating the summary statistics;
 (c)
The compression set (CS) The set of summary statistics which are representative of each cluster.
The main weakness of the BFR Algorithm resides in the assumption that the covariance matrix of each cluster is diagonal, which means that the components of the analyzed multivariate data should be uncorrelated. With such assumption, at each step of the algorithm only the means and variances of each component of the clusters centers must be retained, reducing thus the computational costs. Further, in this setting the estimated covariance matrices are invertible even in presence of clusters composed just by two pdimensional gaussian data points. Anyway such assumptions geometrically imply that the level surfaces (ellipsoids) of the gaussians including the data points in each cluster should be oriented with main axes parallel to the reference system.
Aims and overview of the paper
We here propose a method to clusterize data streams, using a local metric which is estimated in real time from the data. Such metric is based on the Mahalanobis distance of the data points from each cluster center \({\mathbf {c}}_i\), computed using an estimator of the covariance matrix of the corresponding ith cluster. In the following we will always represent vectors as column vectors and we will assume that our data are vectors in \(\mathbb {R}^p\).
We divide the data in the same three sets defined in the BFR algorithm, we don’t fix a priori the number K of clustersρ, and we evaluate and update such number using a density condition. Thus in our procedure from time to time new clusters will be formed, composed only by a few data points, not sufficient to obtain a positive definite estimate of the corresponding covariance matrix using the classical sample covariance estimator. We thus use an optimal double shrinkage estimator of the covariance matrix, which provides always positive definite matrices, that are then inverted to compute the Mahalanobis distance.
In our setting we will relax a little bit the assumption of gaussianity stated in the BFR algorithm, assuming that the data come from a mixture of “bell shaped” distributions, but possibly having a bigger multivariate kurtosis (i.e. fatter queues) than a gaussian.
Our algorithm is thus an improvement of the BFR algorithm, relaxing some of its assumptions. Since with our method also the covariance terms of the clusters must be retained, there is an increase in the computational costs with respect to BFR, but such increase can be easily controlled and is affordable if the processed data are not extremely high dimensional. Therefore our algorithm is targeted to problems with data streams composed by data points of “medium” dimension, i.e. a dimension not so small to apply visualization techniques to identify the clusters (2D or 3D problems), which usually work better, but much smaller than the number of available data.
The paper is then structured as follows: in "The covariance matrices of the clusters" section we face the problem of the estimate of the covariance matrix of each cluster. We modify a Steinian linear shrinkage estimator in order to obtain a positive definite estimator of the covariance matrix, which can be applied also to nongaussian cases, and which can be incrementally updated during the data processing. In "Summary statistics and primary data compression" section we introduce the summary statistics that will be retained in memory for each cluster, and we show that they can easily be updated when new data streams are processed. We then describe the way by which the data points are assigned to the three sets RS, CS, DS. In "Secondary data compression" section we describe the secondary compression, that is the way by which the points in RS and CS can be merged to preexisting clusters or are put together to form new clusters. In "Results on simulated and real data and discussion" section we apply our method first to synthetic data, and we compare heuristically its performances with the case in which the data points are assumed to have uncorrelated components, like in the BFR algorithm. We then apply our method to cluster the real dataset KDDCUP’99 (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html), a network intrusion detection dataset, that was used also to test the STREAM algorithm. We apply our algorithm to all the variables in the dataset which are declared continuous. Actually some of such variables have a very small variance; anyway our optimal double shrinkage estimator of the covariance matrices of the clusters guarantees positive definite estimates also in this situation, stabilizing thus the local Mahalanobis distances that we use in our procedure. The results are coherent with the structure of the dataset, whose data should be divided into five clusters, as we obtain.
In this paper we don’t study the asymptotic properties of our algorithm, but we limit ourselves to show heuristically that our algorithm provides better results of other methods to cluster data streams present in literature, just with a small increase in the computational costs.
The covariance matrices of the clusters
Our algorithm is based on the Mahalanobis distance, it is hence crucial to estimate the covariance matrices of the clusters in an optimal way. Let us first observe that when a new cluster is formed, it contains too few data points to obtain a positive definite estimate of the covariance matrix, using the sample covariance matrix, at least until \(N\le p,\) where N is the number of data in a cluster and p the data points dimension.
In the following we will describe the details of our method and the assumptions that must be satisfied to apply it.
A model for the estimate of the covariance matrices
Lemma 1
Lemma 2
Lemma 3
Optimal shrinkage estimation
We now use the previous results to solve the problem of finding the optimal estimates of \(\hat{\lambda }_I,\hat{\lambda }_D\) in (1), as a function of the statistics S (sample covariance matrix of the data in the same cluster), of \(Q_N\) given in (4), and of two quantities \(\mathbb {S}_N\) and \(\mathbb {T}_N\) that can be updated inductively. As can be seen in (2), the problem here is the unbiased estimation of the terms \({\mathrm {tr}}[\Sigma ^2]\) and \({\mathrm {tr}}( \Sigma ^2 )  {\mathrm {tr}}( D_\Sigma ^2) .\) The derivation of this estimate is given in the next section, after a technical result given hereafter.
Lemma 4
See Appendix for the proof.
Unbiased estimators of \({\mathrm {tr}}(\Sigma ^2)\) and \({\mathrm {tr}}( \Sigma ^2 )  {\mathrm {tr}}( D_\Sigma ^2)\)
Summary statistics and primary data compression
In this section we define the summary statistics that will be retained in memory for each cluster and we describe the first phase of our clustering procedure. As in the BFR algorithm, we first perform the primary data compression, that is the identification of items which can be assigned to a cluster, and then discarded (Discard Set, DS), after updating the corresponding summary statistics contained in the Compression Set CS. Data compression refers thus to representing groups of points by their summary statistics and purging these points from RAM. In our algorithm, like in BFR, primary data compression will be followed by a secondary datacompression, which takes place over data points in the Retained Set (RS), not compressed in the primary phase.
In particular the statistics \({\mathbf {s}}_N\) are needed to compute the sample means \(\bar{\mathbf {x}}_N=\frac{1}{N}\sum _{i=1}^N\mathbf {x}_i\), that are used as clusters centers, while the matrices \(\Sigma _N\) are used to compute the unbiased sample covariance matrices of the clusters \(S=\frac{1}{N1}\sum _{i=1}^n(\mathbf {x}\bar{\mathbf {x}}_N)(\mathbf {x}\bar{\mathbf {x}}_N)^\top\), which are needed, together with \(Q_N, \mathbb {S}_N, \mathbb {T}_N\), to compute the optimal double shrinkage estimators described in the previous section.
Note that the matrix \(\Sigma _N\) is symmetric, thus at each step of the algorithm we have to retain in memory only \(\frac{p(p + 1)}{2} + p + 4 = \frac{p^2}{2} + \frac{3}{2}p + 4\) summary statistics for each cluster, where p is the dimension of the data points. Thus, in case of K clusters, our computational costs are of the order of \(Kp^2\). In addition, note that we should simply sum the corresponding statistics if we want to merge two clusters.
We then approximate locally the distribution of the clusters with a pvariate Gaussian and we build confidence regions around the centers of the clusters (see [9]). Following the approach stated in [3], which is motivated by the assumption that the mean is unlikely to move outside of the computed confidence interval, we perturb \(\bar{\mathbf {x}}_h\) by moving it in the farthest position from \(\mathbf {x}\) in its confidence region, while we perturb the centers of the other clusters by moving them in the closest positions with respect to \(\mathbf {x}\) and we check if the cluster center closer to \(\mathbf {x}\) is still \(\bar{\mathbf {x}}_h\). If yes, we assign \(\mathbf {x}\) to cluster h, we update the corresponding summary statistics and we put \(\mathbf {x}\) in the discard set; otherwise, we put \(\mathbf {x}\) in the retained set (RS) (see Fig. 2). If in the first comparisons the point \(\mathbf {x}\) is closer to a point \(\mathbf {x}_o\) of the retained set than to any cluster, we form a new secondary cluster with the two points if \(\mathbf {x}_o\) remains the closest to \(\mathbf {x}\) after the centers’ perturbation. In this case we add the corresponding summary statistics to the compressed set CS, and we put \(\mathbf {x}\) and \(\mathbf {x}_o\) in the discard set. Otherwise we put \(\mathbf {x}\) and \(\mathbf {x}_o\) in RS (see Fig. 3).
Let us see the procedure of centers’ perturbation in deeper detail.
Confidence regions
Secondary data compression
The purpose of secondary data compression is to identify “tight” subclusters of points among the data that we cannot discard in the primary phase. In [3] this is made using the euclidean metric.
We adopt a similar idea, but we use a local metric, based on the Mahalanobis distance. We exploit a technique based on hierarchical clustering, mimicking the Ward’s method [6, 13].
The distance between a single retained point \(\mathbf {x}\) and a cluster h is computed by the squared Mahalanobis distance \(\Delta ^2_{\hat{S}_h}(\mathbf {x},\bar{\mathbf {x}}_h)\) between the point and the cluster centroid, based on the estimated covariance matrix of the cluster, while the distance between two retained points \(\mathbf {x}_1,\mathbf {x}_2\) is computed by their squared Mahalanobis distance \(\Delta _{\hat{S}_{P}}^2({\mathbf {x}}_1,{\mathbf {x}}_2)\) based on the pooled covariance matrix (13) of all the clusters.

We merge two clusters \(h_1\) and \(h_2\) if \(\Delta _{h_1,h_2}^2 <\theta _0 \Delta ^2_{\hat{S}_{h_1h_2}}(\bar{\mathbf {x}}_{h_1},\bar{\mathbf {x}}_{h_2})\);

We merge a retained point \({\mathbf {x}}\) and a cluster h if \(\Delta ^2_{\hat{S}_h}(\mathbf {x},\bar{\mathbf {x}}_h)<\theta _1(tr(S_{h}))\);

We merge two retained points \({\mathbf {x}}_1\) and \({\mathbf {x}}_2\) if \(\Delta _{\hat{S}_{P}}^2({\mathbf {x}}_1,{\mathbf {x}}_2)<\theta _2\).
\(H_0\): the retained points come from a gaussian distribution with covariance matrix given by the pooled covariance matrix (13) of all the clusters.
Results on simulated and real data and discussion
Results on synthetic data
Synthetic data were created for the cases of 5 and 20 clusters. Data were sampled from 5 or 20 independent pvariate Gaussians, with elements of their mean vectors (the true means) uniformly distributed on [−5, 5]. The covariance matrices were generated by computing products of the type \(\Sigma =UHU^T\), where H is a diagonal matrix with elements on the diagonal distributed as a Beta (0.5, 0.5) rescaled to the interval [0.5, 2.5], and U is the orthonormal matrix obtained by the singular value decomposition of a symmetric matrix \(MM^T,\) where the elements of the \(p\,\times \,p\) matrix M are uniformly distributed on [−2, 2]. In either cases of 5 or 20 clusters, we generated 10,000 vectors for each cluster, having dimensions \(p=5,\,10,\, 20.\)
Results of the application of our proposed algorithm (PA) and of the BFR algorithm to synthetic data
N. of true clusters  Algorithm  Dimension p of data points  N. of data in each chunk  N. of estimated clusters  N. of retained points (outliers) 

5  BFR  5  25  6  0 
5  PA  5  25  5  0 
5  BFR  5  50  6  0 
5  PA  5  50  5  0 
5  BFR  10  25  5  0 
5  PA  10  25  5  0 
5  BFR  10  50  5  0 
5  PA  10  50  5  0 
5  BFR  20  25  5  0 
5  PA  20  25  5  0 
5  BFR  20  50  5  0 
5  PA  20  50  5  0 
20  BFR  10  25  12  0 
20  PA  10  25  17  0 
20  BFR  10  50  13  0 
20  PA  10  50  22  1 
20  BFR  20  25  11  0 
20  PA  20  25  19  0 
20  BFR  20  50  20  0 
20  PA  20  50  20  0 
We applied both our procedure and the BFR algorithm to these synthetic data, to compare the performance of the two methods. In both cases, we computed the secondary data compression once out of 25, or out of 50 data points. In the tests on data from 20 clusters we started from a lower number of initial clusters (equal to 10), in order to check the ability of our algorithm to detect the correct number of clusters. The results are reported in Table 1.
We note that the number of clusters is sometimes underestimated by our method, in particular in the case of 20 clusters. In such cases, if the point clouds in different clusters are gathered in rather close ellipsoids, then the correct detection of the clusters may be more difficult. Anyway in all cases the estimates provided by our algorithm are equal or better than those obtained with the BFR algorithm.
We also point out that in the case of 20 clusters with \(p=10\), and secondary compression performed once out of 50 processed data, the overestimation of the number of clusters obtained with our algorithm is compensated by the presence of two small clusters, composed by a few hundreds of data points, which can then be revisited as groups of outliers. Anyway also in this case our results are better than those obtained with BFR.
The method seems to be sensitive to the frequency of the secondary compression only in presence of many clusters.
We tested also cases with bigger values of p, but in such cases both algorithms are able to detect the correct number of clusters, in an equivalent way, since a few clusters in high dimensional spaces are almost always well separated, because of “curse of dimensionality” reasons.
Results on a real dataset
We applied our algorithm to a real dataset to detect network intrusions. Detecting intrusions is a typical data streaming problem, since it is essential to identify the event while it is happening. In our experiments we used the KDDCUP’99 (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) intrusion detection dataset which consists of 2 weeks of raw TCP dump data. This dataset is related to a local area network simulating a true Air Force environment with occasional attacks. Variables collected for each connection include the duration of the connection, the number of bytes transmitted from source to destination (and viceversa), the number of failed login attempts, etc. We applied our algorithm to the 34 variables that are declared to be continuous.
Some of these variables actually are almost constant, giving thus an estimated zero sample variance in many clusters. In such situation, if the BFR algorithm is applied, singular covariance matrices are estimated for some clusters. Consequently the Mahalanobis distance becomes unstable. Our optimal double shrinkage estimators are thus necessary to overcome this instability, and as a byproduct, they can take into account the deviation of the kurtosis from the Gaussian case.
The same dataset was analysed in [16], via the STREAM algorithm, but they used the Euclidean distance, which is a global distance that gives the same importance to all the variables.
We obtained stable results. We applied the secondary compression every 100 data, starting from 4 clusters composed by less than 20 points. We observed the presence of 6–8 big clusters starting from about 100,000 processed data. We processed about 646,000 data, ending with five big clusters, composed by the following number of points: 133,028; 121,661; 242,206; 53,235; 95,977. Note that we detected the final correct number of clusters, since in this dataset there are four possible types of attacks, plus no attacks. The four types of attacks are denialofservice; unauthorized access from a remote machine (e.g. guessing password); unauthorized access to local superuser (root) privileges; surveillance and other probing (e.g., port scanning).
Conclusion
We have introduced a new algorithm to cluster data streams with correlated components. Our algorithm in some parts imitates the BFR algorithm, since, like BFR, it uses a local distance approach, based on the computation of the Mahalanobis distance. In order to compute such distance, positive definite estimators of the covariance matrices of the clusters are needed, also when the clusters contain just a few data points. We obtained such estimators by considering a Steinian double shrinkage method, which leads to covariance matrix estimators that are nonsingular, wellconditioned, expressed in a recursive way and thus computable on data streams. Further such estimators provide positive definite estimates also when some components of the data points have a small variance, or the data distribution has a kurtosis different from the Gaussian case.
We applied both our proposed method and the BFR algorithm to synthetic gaussian data, and we compared their performance. From the numerical results we conclude that our method provides rather good clustering on synthetic data, and performs better than the BFR algorithm in particular in presence of few clusters in spaces of rather low dimension. This is reasonable since the BFR algorithm approximates the “clouds” of data with ellipsoids having axes parallel to the reference system, and this leads to a wrong classification when the clusters are elongated, not much separated, and with axes rotated with respect to the reference system. In such situations our algorithm is able to capture in a more proper way the geometry of the clusters, and thus improves the classification.
Anyway the secondary compression could be possibly improved by applying some incremental modelbased technique (see [5]), but modified in such a way to avoid multiple scans of the sample.
We also applied our algorithm to a real dataset, obtaining good results in terms of correct identification of the number of clusters, and stability of our algorithm.
The advantage of our algorithm with respect to other methods present in literature, like BFR or STREAM, is that it relaxes the assumptions on the processed data streams, and can thus be effectively applied to a wider class of cases, on which it performs better. In the cases where the assumptions of the other methods are satisfied, our algorithm provides equivalent results. It can then be systematically substituted to other methods to analyze data streams, in all cases in which the data points are not too much high dimensional.
Declarations
Authors' contributions
The contribution of each author is equal. Both authors read and approved the final manuscript.
Acknowledgements
G. Aletti is a member of “Gruppo Nazionale per il Calcolo Scientifico (GNCS)” of the Italian “Istituto Nazionale di Alta Matematica (INdAM)”.
Competing interests
The authors declare that they have no competing interests.
Availability of data and materials
All data and materials are available upon request.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Funding
This work has been partially supported by the Universitá degli Studi di Milano Grant Project 2017 “Stochastic modelling, statistics and study of the invariance properties of stochastic processes with geometrical and spacetime structure in applications”, and by ADAMSS Center funds for Big Data research.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
Authors’ Affiliations
References
 Aggarwal CC. Data clustering: algorithms and applications. 1st ed. Boca Raton: Chapman and Hall/CRC; 2013.Google Scholar
 Arthur D, Vassilvitskii S. Kmeans++: the advantages of careful seeding. In: Proceedings of the eighteenth annual ACMSIAM symposium on discrete algorithms, society for industrial and applied mathematics. Philadelphia: SODA ’07. 2007. p. 1027–35. http://dl.acm.org/citation.cfm?id=1283383.1283494.
 Bradley PS, Fayyad UM, Reina C. Scaling clustering algorithms to large databases. In: KDD. 1998. p. 9–15Google Scholar
 Fisher TJ, Sun X. Improved Steintype shrinkage estimators for the highdimensional multivariate normal covariance matrix. Comput Stat Data Anal. 2011;55(5):1909–18. https://doi.org/10.1016/j.csda.2010.12.006.MathSciNetView ArticleMATHGoogle Scholar
 Fraley C, Raftery A, Wehrens R. Incremental modelbased clustering for large datasets with small clusters. J Comput Gr Stat. 2005;14(3):529–46. https://doi.org/10.1198/106186005X59603.View ArticleGoogle Scholar
 Gan G, Ma C, Wu J. Data clustering: theory, algorithms and applications. ASASIAM Ser Stat Appl Probab. 2007;20:1–466. https://doi.org/10.1137/1.9780898718348.MATHGoogle Scholar
 Garofalakis M, Gehrke J, Rastogi R, editors. Data stream management: processing highspeed data streams. Datacentric systems and applications. Cham: Springer; 2016. https://doi.org/10.1007/9783540286080.
 Himeno T, Yamada T. Estimations for some functions of covariance matrix in high dimension under nonnormality and its applications. J Multivar Anal. 2014;130:27–44. https://doi.org/10.1016/j.jmva.2014.04.020.MathSciNetView ArticleMATHGoogle Scholar
 Hotelling H. The generalization of student’s ratio. Ann Math Stat. 1931;2(3):360–78. https://doi.org/10.1214/aoms/1177732979.View ArticleMATHGoogle Scholar
 Ikeda Y, Kubokawa T, Srivastava MS. Comparison of linear shrinkage estimators of a large covariance matrix in normal and nonnormal distributions. Comput Stat Data Anal. 2016;95:95–108. https://doi.org/10.1016/j.csda.2015.09.011.MathSciNetView ArticleGoogle Scholar
 Jain AK. Data clustering: 50 years beyond kmeans. Pattern Recogn Lett. 2010;31(8):651–66. https://doi.org/10.1016/j.patrec.2009.09.011.View ArticleGoogle Scholar
 Kaufman L, Rousseeuw P. Finding groups in data an introduction to cluster analysis. New York: Wiley; 1990.MATHGoogle Scholar
 Legendre P, Legendre LF. Numerical ecology, vol. 24. London: Elsevier; 2012.MATHGoogle Scholar
 Leskovec J, Rajaraman A, Ullman J. Mining of massive datasets. 2nd ed. Cambridge: Cambridge University Press; 2014.View ArticleGoogle Scholar
 Ng RT, Han J. Clarans: a method for clustering objects for spatial data mining. IEEE Trans Knowl Data Eng. 2002;14(5):1003–16. https://doi.org/10.1109/TKDE.2002.1033770.View ArticleGoogle Scholar
 O’Callaghan L, Mishra N, Meyerson A, Guha S, Motwani R. Streamingdata algorithms for highquality clustering. In: Proceedings of IEEE international conference on data engineering. 2001. p. 685Google Scholar
 Rencher A. Multivariate statistical inference and applications. Wiley series in probability and statistics: texts and references section. New York: Wiley; 1998.Google Scholar
 Touloumis A. Nonparametric Steintype shrinkage covariance matrix estimators in highdimensional settings. Comput Stat Data Anal. 2015;83:251–61. https://doi.org/10.1016/j.csda.2014.10.018.MathSciNetView ArticleGoogle Scholar
 Zhang T, Ramakrishnan R, Livny M. Birch: an efficient data clustering method for very large databases. SIGMOD Rec. 1996;25(2):103–14. https://doi.org/10.1145/235968.233324.View ArticleGoogle Scholar