Skip to main content

Unsupervised outlier detection in multidimensional data

Abstract

Detection and removal of outliers in a dataset is a fundamental preprocessing task without which the analysis of the data can be misleading. Furthermore, the existence of anomalies in the data can heavily degrade the performance of machine learning algorithms. In order to detect the anomalies in a dataset in an unsupervised manner, some novel statistical techniques are proposed in this paper. The proposed techniques are based on statistical methods considering data compactness and other properties. The newly proposed ideas are found efficient in terms of performance, ease of implementation, and computational complexity. Furthermore, two proposed techniques presented in this paper use transformation of data to a unidimensional distance space to detect the outliers, so irrespective of the data’s high dimensions, the techniques remain computationally inexpensive and feasible. Comprehensive performance analysis of the proposed anomaly detection schemes is presented in the paper, and the newly proposed schemes are found better than the state-of-the-art methods when tested on several benchmark datasets.

Introduction

An observation in a dataset is considered an outlier if it differs significantly from the rest of the observations. The problem of finding patterns in data that deviate from the expected behavior is called the anomaly detection or the outliers’ detection problem. Outliers in data can occur due to the variability in measurements, experimental errors, or noise [1], and the existence of outliers in data makes the analysis of data misleading and degrades the performance of machine learning algorithms [2, 3].

Several techniques have been developed in the past to detect outliers in data [4,5,6]. The techniques for outlier detection can be broadly classified as methods based on: (i) Clustering [7], (ii) Classification [8], (iii) Neighbor based [9], (iv) Statistical [10], (v) Information-Theoretic [11], and (vi) Spectral methods [12]. The working of classification-based methods mostly relies on a confidence score, which is calculated by the classifier while making a prediction for the test observation. If the score is not high enough, the observation is not assigned any label and is considered an outlier. Some clustering-based methods identify the outliers by not forcing every observation to belong to a cluster, and the observations that are not assigned to any cluster are identified as outliers. The nearest neighbor techniques are mostly based on a calculation of the distance or similarity measure between the observation and its neighboring observations. Suppose the calculation is greater than a certain threshold, that means that the observation lies far apart from the rest of the observations and is considered as an outlier. Statistical methods usually fit a statistical distribution (mostly normal distribution) to the data and conduct a statistical inference test to see if the observation belongs to the same distribution or not. If not, the observation is marked as an outlier. Information-theoretic techniques use different information theoretic measures for example entropy, relative entropy, etc., to analyze the information content of the data. These techniques are based on an assumption that the outliers or anomalies in the data induce irregularities in the information content. Spectral methods transform the data to a new dimensional space such that the outliers are easily identified and separated from the data in the new space. Furthermore, some outlier detection techniques are also based on geometric methods [13] and neural networks [14].

All the techniques mentioned above are based on some assumptions and all the techniques have some pros and cons as described in Table 1. The ideas proposed in this work are based on the novel statistical methods considering data properties like compactness. Where, the compactness of data is estimated either by the interactions between different kernel function tails, new and adapted kernel functions are proposed, or by the variance of independent gaussian distributions over different regions that are captured as a new clustering idea. Moreover, contrary to the existing approaches, statistical methods are modeled based on the transformation of data into a unidimensional distance space. The newly proposed methods are based on boxplot adjustment, kernel based probability density estimation and neighborhood information. The aim here is to utilize the power of some statistical methods and to enhance the performance of the outlier detection algorithms in an unsupervised way, while keeping their implementation easy and computationally efficient. The proposed methods are evaluated using both the synthetic and the real datasets and are found better in scenarios where the traditional approaches fail to perform well. Especially the cases where the data is contaminated with a mixture of different noise distributions.

Table 1 Assumptions, advantages, and disadvantages of different outlier detection approaches

The rest of the paper is organized as follows: “Background” section, “Proposed methods” section, Evaluation on synthetic datasets “Evaluation using synthetic examples” section, Evaluation on real data “Evaluation of boxplot adjustments using a real example” section, “Comparison with State-of-Art” section and “Conclusions” section.

Background

After the initial pivotal works for outlier detection based on the distance measure [15, 16], several new methods based on the distance measure were also proposed by different authors in literature [17, 18]. The difference between the latterly proposed methods and the previous studies is the use of nearest neighbors in distance calculation. Among the variants of the actual work, either a single distance based on the kth closest neighbor is calculated [19] or the aggregate of the distance of k closets points is calculated [20]. Among other unsupervised outlier detection algorithms are the local approaches originated from the concept of Local Outlier Factor [21].

Furthermore, boxplot outlier detection scheme is also one of the fundamental unsupervised approach and the concept of univariate boxplot analysis was first proposed by Tukey et. al. [22]. In a univariate boxplot, there are five parameters specified as: (i) the upper extreme bound (UE), (ii) the lower extreme bound (LE), (iii) the upper quartile Q3 (75th percentile), (iv) the lower quartile Q1 (25th percentile) and (v) the median Q2 (50th percentile). The best way to estimate the extreme boundaries is to estimate Probability Density Function (PDF), \(f(x)\), at first step from where the boundaries will be defined, as follows:

$$\left\{ {\begin{array}{*{20}c} {UE:\frac{\tau }{2} = P\left( {X > UE} \right) = \mathop \smallint \limits_{UE}^{ + \infty } f\left( x \right)dx} \\ {LE:\frac{\tau }{2} = P\left( {X < LE} \right) = \mathop \smallint \limits_{ - \infty }^{LE} f\left( x \right)dx} \\ \end{array} } \right..$$
(1)

where \(\tau\) is the significance level, the region of suspected outliers is defined for \(\tau =0.05\) and the region of extremely suspected outliers is defined for \(\tau =0.01\). The Eq. (1) estimates well the boundaries only if the distribution is unimodal, i.e., a distribution that has single peak or at most one frequent value.

However, in a standard boxplot the UE and LE values are computed and well estimated only under the assumption that the PDF is symmetric, as:

$$\left\{ {\begin{array}{*{20}c} {LE = Q1 - 1.5\left( {IQR} \right), } \\ {UE = Q3 + 1.5\left( {IQR} \right).} \\ \end{array} } \right.$$
(2)

where the term IQR is defined as the Inter Quartile Range and is given by:

$$IQR = Q3 - Q1.$$
(3)

A common practice to identify the outliers in a dataset using a boxplot is to mark the points that lie outside the extreme values, that is, the points greater than UE and less than LE are identified as outliers. This version of outlier detection scheme works well for the symmetric data. However, for skewed data different other schemes are proposed in the literature. For example, different authors have used the semi-interquartile range i.e. \(Q3-Q2\) and \(Q2-Q1\) to define the extreme values as:

$$\left\{ {\begin{array}{*{20}c} {LE = Q1 - c_{1} \left( {Q2 - Q1} \right), } \\ {UE = Q3 + c_{2} \left( {Q3 - Q2} \right). } \\ \end{array} } \right.$$
(4)

where \({c}_{1}\) and \({c}_{2}\) are the constants and different authors have adjusted their values differently for example, \({c}_{1}={c}_{2}=1.5\) [23] \({c}_{1}={c}_{2}=3\) [24] or calculation based on the expected values of the quartiles [25] and few more adjustments to the boxplot for outliers detection are also available, for example [26].

The traditional methods of boxplot for detecting the outliers sometimes fails in situations where the noise in the data is a mixture of distributions, multimodal distribution, or in the presence of small outlier clusters. In this paper, some novel statistical schemes based on (i) the boxplot adjustments, and (ii) a new probability density estimation using k-nearest neighbors’ distance vector are proposed to overcome the problem faced by traditional methods. These proposed methods are described in detail in the next section.

Proposed methods

Boxplot adjustments using D-k-NN (BADk)

Traditional boxplot identifies the outliers from unique dimensions. One useful idea other than using all the dimensions for identifying outliers is to transform the data into a unidimensional distance space and to identify the outliers in new space. This can simply be done by measuring the distance between data points considering all the dimensions and calculating the resulting distance vector. The idea of using a single dimension distance vector is useful not only for avoiding problem of sorting data in high dimension but also in terms of computational cost, and can be further enhanced in terms of performance by extending it to consider k number of neighbors in the distance calculation. This idea of boxplot adjustment based on the Distance vector considering k number of Nearest Neighbors (D-k-NN) is presented here and the resulting extreme values estimation from the modified boxplot are found to be quite useful in identifying the right outliers. Furthermore, the proposed scheme is useful in the cases where the distribution of the noise is not normal or is a mixture of different distributions, and can identify small outlier clusters in the data.

Suppose a dataset in \({\mathbb{R}}^{N}\), this dataset is transformed from N dimensional space to a unidimensional distance space by using a distance metric such as ‘Euclidian distance’. This is done by computing the distance of each observation in N dimensional space to its kth closest neighbor. This transformation results in a set that contains the distance of each observation to its kth closest neighbor, and the resulting set is represented as \({d}_{k}\in {\mathbb{R}}\). This transformation can be represented as:

$$d_{k} :{\mathbb{R}}^{N} \to {\mathbb{R}}$$
(5)

The set \({d}_{k}\) is used for computing the extreme value of the boxplot as follows:

$$\left\{ {\begin{array}{*{20}c} {LE_{{d_{k} }} = Q1_{{d_{k} }} - c_{1} \left( {Q2_{{d_{k} }} - Q1_{{d_{k} }} } \right), } \\ {UE_{{d_{k} }} = Q3_{{d_{k} }} + c_{2} \left( {Q3_{{d_{k} }} - Q2_{{d_{k} }} } \right).} \\ \end{array} } \right.$$
(6)

From the extreme values defined in (6), the outliers are identified as points those lie outside the boundaries of \({LE}_{{d}_{k}}\) and \({UE}_{{d}_{k}}\). The two constant values \({c}_{1}\) and \({c}_{2}\) are adjustable with respect to the dataset under consideration and selection of smaller values for these constants results in more points being marked as outliers. The suggested values of these constants by different authors in the literature are \({c}_{1}={c}_{2}=1.5\) or \({c}_{1}={c}_{2}=3\).

Furthermore, another useful idea to identify the outliers in a data is to adjust the UE and LE values of a boxplot as follows:

$$\left\{ {\begin{array}{*{20}c} {LE = Q1_{{d_{k} }} - c_{1} \times \sqrt {var\left( {X.1_{{X < Q2_{{d_{k} }} }} } \right)} , } \\ {UE = Q3_{{d_{k} }} + c_{2} \times \sqrt {var\left( {X.1_{{X \ge Q2_{{d_{k} }} }} } \right).{ }} } \\ \end{array} } \right.$$
(7a)

or

$$\left\{ {\begin{array}{*{20}c} {LE = Q1_{{d_{k} }} - c_{1} \times \sqrt {var\left( {X.1_{{X < Q1_{{d_{k} }} }} } \right),{ }} } \\ {UE = Q3_{{d_{k} }} + c_{2} \times \sqrt {var\left( {X.1_{{X \ge Q3_{{d_{k} }} }} } \right).} } \\ \end{array} } \right.$$
(7b)

where \(var\) is defined as the variance and the quartiles are computed from the set \({d}_{k}\in {\mathbb{R}}\). The extreme values can also be estimated based on the calculation of the separation threshold between centers of two variances, see Eq. (9), as:

$$\left\{ {\begin{array}{*{20}c} {LE = M - c_{1} \times var\left( {X.1_{X < M} } \right), } \\ {UE = M + c_{2} \times var\left( {X.1_{X \ge M} } \right). } \\ \end{array} } \right.$$
(8)

where \(M\) is a value that separate the one-dimension region in order to calculate the variance of two centers. Let \(x \epsilon {\mathbb{R}}\) be any random variable with PDF \(f(x)\); and the values of \({\mu }_{1}\) and \({\mu }_{2}\) are calculated such that:

$$\left\{ {\begin{array}{*{20}c} {\left( {\mu_{1}^{*} ,\mu_{2}^{*} } \right) = \mathop {\arg }\limits_{{(\mu_{1} ,\mu_{2)} }} \left[ {\mathop {\min }\limits_{{M,\mu_{1} ,\mu_{2} }} \left[ {\mathop \smallint \limits_{ - \infty }^{M} \left( {x - \mu_{1} } \right)^{2} f\left( x \right)dx + { }\mathop \smallint \limits_{M}^{\infty } \left( {x - \mu_{2} } \right)^{2} f\left( x \right)dx} \right]} \right],} \\ {Var_{1} = \mathop \smallint \limits_{ - \infty }^{M} \left( {x - \mu_{1} } \right)^{2} f\left( x \right)dx,} \\ {Var_{2} = \mathop \smallint \limits_{M}^{\infty } \left( {x - \mu_{2} } \right)^{2} f\left( x \right)dx. } \\ \end{array} } \right.$$
(9)

Both, \({Var}_{1}\) and \({Var}_{2}\) can be partially differentiated with respect to \({\mu }_{1}\) and \({\mu }_{2}\) respectively, to find the minimum. After simplification the minimization occurs when:

$$\left\{ {\begin{array}{*{20}c} {\mu_{1} = \frac{{E\left( {X_{ - } } \right)}}{{P\left( {X_{ - } } \right)}},} \\ {\mu_{2} = \frac{{E\left( {X_{ + } } \right)}}{{P\left( {X_{ + } } \right)}}, } \\ \end{array} } \right.$$
(10)

where \({X}_{-}= X.{1}_{X<M}\) and \({X}_{+}= X.{1}_{X\ge M.}\) and the value of \(M\) is calculated as:

$$M = \left[ {\frac{{E\left( {X_{ - } } \right)}}{{P\left( {X_{ - } } \right)}} + \frac{{E\left( {X_{ + } } \right)}}{{P\left( {X_{ + } } \right)}}} \right].\frac{1}{2}$$
(11)

For further details on the idea proposed in (8)–(11), the readers are referred to [27].

Detecting outliers based on Boxplot is efficient only if the data is unimodal distribution. To overcome the drawbacks of the boxplot estimation, some other statistical methods based on the probability density estimation computed from either the set \({d}_{k}\in {\mathbb{R}}\) or the actual data \(D\in {\mathbb{R}}^{N}\) are also proposed for outlier’s detection, which are discussed below.

Joint probability density estimation using D-k-NN

The methods proposed in this section compute the set \({d}_{k}\) from the actual data and utilize it for estimating some parameters of the joint distribution function. Three different schemes are proposed here which are described as follows:

Scheme 1: Normal distributions are often used for representing the real value random variables with unknown distributions [28, 29]. The joint probability density function of independent and identically normal distribution is given as:

$$f\left( {x_{1} , \ldots ,x_{N} } \right) = \frac{1}{{\left( {\zeta \sqrt {2\pi } } \right)^{N} }}e^{{\mathop \sum \limits_{i = 1}^{N} - \frac{1}{2}\left( {\frac{{x_{i} - \mu_{i} }}{\zeta }} \right)^{2} }}$$
(12)

where \(\zeta\) is the standard deviation modeled differently in (15) and (17), \(\mu\) is the mean of the random variable and N is the dimension of the data. Here, some functions based on the normal distribution to identify the outliers in a dataset are proposed. Suppose a two-dimensional dataset \(D(x,y)\), we can define a separation threshold \(T\) based on the normal distribution for detecting the outliers such that:

$$\left\{ {\begin{array}{*{20}c} {Z = \mathop \sum \limits_{i = 1}^{n} f\left( {x_{i} ,y_{i} } \right),} \\ {T = \alpha max\left( Z \right). } \\ \end{array} } \right.$$
(13)

where Z is joint probability distribution function after normalization, \(n\) is the total number of observations and the function \(f\left(x,y\right)\) can be defined as:

$$f\left( {x,y} \right) = \frac{1}{{2\pi \zeta^{I} }}e^{{ - \left( {\frac{{\left( {x - x_{i} } \right)^{2} + \left( {y - y_{i} } \right)^{2} }}{{2\zeta^{2} }}} \right)}} ;i = 1,2, \ldots ,{ }n.;I = 0,1,2.$$
(14)

The \(\sigma\) in Eq. (14) can be computed as:

$$\zeta = \beta Q3_{{d_{k} }} .$$
(15)

where \({Q3}_{{d}_{k}}\) is the third quartile computed from the set \({d}_{k}\) as defined in Eq. (5) and \(\beta\) is a constant value. The points below the threshold value \(T\) defined in Eq. (13) are considered as outliers and the points above \(T\) are considered normal inlier data points. The \(\alpha\) used in Eq. (13) is the significance value and it can be used to control the percentage amount of data to be removed as outliers.

Scheme 2: To better detect the outliers, a better function \(f\left(x,y\right)\) needs to be constructed in order to weaken the position of the outlier in terms of support and amplitude of the function. Furthermore, another scenario can be defined to detect the outliers based on the threshold defined in (13) by using the below function:

$$f\left( {x,y} \right) = \frac{{\zeta^{2} }}{\pi }e^{{ - \left( {\frac{{\left( {x - x_{i} } \right)^{2} + \left( {y - y_{i} } \right)^{2} }}{{\zeta^{2} }}} \right)}} ;i = 1,2, \ldots ,{ }N.$$
(16)
$$\zeta = \frac{\gamma }{{\left( {1 + d_{k} } \right)^{2} }}$$
(17)

where \(k\) defines the kth closest neighbor for the distance metric and \(\gamma\) is a constant whose value can be adjusted to control the smoothness of the gaussian distribution. The concept is demonstrated in Fig. 1, where (a) shows the effect of traditional gaussian approach on compression and (b) shows the effect of proposed scheme 2 on compression.

Fig. 1
figure1

a Effect of traditional gaussian approach on compression. b Effect of proposed scheme 2 approach on compression in x and y axes

Both of the above schemes proposed in this section are based on a single gaussian distribution and are expected to work well for the datasets which can be well approximated using a single gaussian distribution. However, if a dataset can be better approximated using multiple gaussians then a better idea is to use a model based on the variable number of gaussians. A new and robust estimation of multiple gaussian distribution is proposed in the next subsection.

Scheme 3: The scenarios where the data is estimated using a gaussian distribution, the outliers are identified as the points lying on the extreme tails of the gaussian distribution, as shown in Fig. 2a. However, if the better estimation of underlying data is possible through multiple gaussians, the outliers located at the connecting points of different gaussians might remain unidentified using a single gaussian estimation. In order to identify the outliers existing at the connecting points of the multiple gaussians, an idea based on multiple gaussian estimation is proposed, where a Rejection Area (RA) is defined and computed as:

$$\left\{ {\begin{array}{*{20}c} {RA = \left\{ {\vec{x}:f\left( {\vec{x}} \right) \le Cv} \right\},} \\ {f\left( {x \in RA} \right) = \tau . } \\ \end{array} } \right.$$
(18)

where \(Cv\) is defined as a critical value or a threshold value below which is the rejection area or where the outliers are identified, and \(\tau\) is the significance level. The concept is shown in Fig. 2b, where as an example a single dimensional data is estimated using two gaussians and the outliers can be identified as the points below \(Cv.\)

Fig. 2
figure2

An Example of Gaussian estimation and marking of critical value for outlier detection. a Points those lie outside the red boundaries are considered outliers. b The points below the critical value are identified as outliers

In order to find the optimum number of gaussians that better approximate the joint probability distribution for a given dataset the sorted values of the vector \({d}_{k}\) can be utilized. For example, in Fig. 3 the graph of sorted values of the vector \({d}_{k}\) is shown and the best value of number of gaussians can be estimated by taking the value where the graph takes off sharply.

Fig. 3
figure3

Plot of sorted values of the vector \({{\varvec{d}}}_{{\varvec{k}}}\)

Each estimated gaussian represent a region and for each gaussian inside a region the values of mean and variance can be computed as:

$$\mu_{i} = \frac{{\mathop \sum \nolimits_{{i \in R_{j} }} x_{i} }}{{n_{i} }},{\text{j}} = {1},{ 2}, \, \ldots .,{\text{ m}}$$
(19)
$$Var\left( x \right) = \frac{{\mathop \sum \nolimits_{{i \in R_{j} }} \left( {x_{i} - \mu_{i} } \right)^{2} }}{{n_{i} - 1}},{\text{j}} = {1},{ 2}, \, ...,{\text{ m}}$$
(20)

where \({R}_{j}\) represents the jth region, m is the total number of estimated gaussians and \({n}_{i}\) is the total number of elements in the respective region. The combined multiple gaussians model is then estimated by:

$$x\sim \mathop \sum \limits_{i = 1}^{m} \alpha_{i} N\left( {\mu_{i} ,C_{i} } \right)$$
(21)

where

$$\alpha_{i} = \frac{{Card\left( {R_{i} } \right)}}{N}\;{\text{and}}\;\sum \alpha_{i} = 1.$$
(22)

In order to determine the regions, lets define an application \({S}_{U}\) that sorts any given sequence \({U}_{i}, i=1,\dots ,N\) such that \({U}_{{S}_{U}(1)}\le {U}_{{S}_{U}\left(2\right)}\le \dots \le {U}_{{S}_{U}(N)}\). For any given data \(\overrightarrow{X}\), the sorted data can be represented as \({\overrightarrow{X}}_{{S}_{U}(i)}\) and suppose that \({\overrightarrow{\Delta X}}_{{S}_{U}(i)}\) represents the difference between two consecutive elements of \({\overrightarrow{X}}_{{S}_{U}(i)}.\) Similarly, the sorted difference can be represented as \({\overrightarrow{\Delta X}}_{{S}_{\Delta X}({S}_{U}(i))}\). In order to define the regions, the elements are grouped together sequentially until\({\Delta X}_{{S}_{\Delta X}({S}_{U}(i))}\le {\overrightarrow{\Delta X}}_{{S}_{\Delta X}({S}_{U}(N-m+1))}\), once this condition is not true, start grouping the remaining elements as a new region until all the elements are assigned to a region.

Evaluation using synthetic examples

The ability of proposed methods is demonstrated here by the use of some two-dimensional synthetic datasets. The results for each of the proposed method are discussed in the following subsections.

Evaluation of boxplot adjustments using D-k-NN (BADk)

Figure 4a shows an example of the data used for evaluating the proposed methods to detect the outliers. From the data shown in Fig. 4a, it can be seen that the actual data is composed of different clusters with different shapes and is contaminated by a mixture of sinusoidal and gaussian distribution of noise. The aim here is to detect the noise as outliers and different shape clusters as inliers. The traditional boxplot is used to detect the outliers from the data and the resulting boxplot is shown in Fig. 4b. It can be seen from the boxplot in Fig. 4b that the traditional boxplot is unable to identify any outliers in the data.

Fig. 4
figure4

a Original data representing clusters with different shapes and a mixture of sinusoidal and gaussian noise. b Traditional boxplot showing no outliers/anomalies

The same dataset is used to evaluate the proposed adjusted boxplot with extreme values define in Eq. (6) and the results are shown in Fig. 5. Different values of \(k\) are used to see how it effects the outcome in identifying the outliers. It can be observed from the results shown in Fig. 5 that for the smaller values of \(k\) only the gaussian noise is identified and while we keep on increasing the value of \(k\) the outliers with sinusoidal distribution are also identified.

Fig. 5
figure5

Outliers detection from the data shown in Fig. 4a, using the proposed boxplot with extreme values defined in Eq. (6) for different values of \(k\). The data in red is identified as the inliers while the data in green is identified as the outliers

However, after a certain value of \(k\) the data points from the actual clusters (inliers) are also marked as the outliers, while the outliers started to reappear as the inliers. This shows that although the selection of value of k is flexible in this case, still an optimum value of \(k\) has to be selected for the optimum performance based on the data. Another example shown in Fig. 6a with a different distribution of noise is also tested for evaluating the ability of the proposed method in (6) for outlier detection. The results for this example are shown in Fig. 7 using three different values of \(k\). It can be seen from the results in Fig. 7 that the selection of value of k is very flexible and still the proposed method performs well in terms of outlier’s detection. During all the experiments performed, the values of constants are fixed to \({c}_{1}={c}_{2}=1.5\).

Fig. 6
figure6

a An example of dataset having different shapes of inlier clusters contaminated with noise. b Traditional boxplot showing no outliers/ anomalies

Fig. 7
figure7

Outliers detection from the data shown in Fig. 6a, using the proposed boxplot with extreme values defined in Eq. (6) for different values of \(k\). The data in red is identified as the inliers while the data in green is identified as the outliers

Evaluation (joint probability density estimation) scheme 1

The density estimation refers to estimation of an unobservable Probability Density Function (PDF) associated with an observable data. The PDF gives an estimate of the density according to which a large population in a data is distributed. In this proposed method, the PDF is computed by placing a gaussian density function at each data point, and then summing the density functions over the range of data, and a threshold value α defines the margin between the inlier data and the outliers. The value of α is computed as a percentage amount of the maximum value of the PDF. The value of \(\sigma\) in Eq. (14) is computed utilizing the \({d}_{k}\) vector as defined in Eq. (15).

The results for scheme 1 when evaluated using the same example data as shown in Fig. 4a are given in Fig. 8. The example is evaluated using only two different values of α and a fixed value of \(\beta\). The outliers are shown in green color and the inlier data is shown in red color. Figure 8 also shows the associated 3D plots of the probability density estimations computed using Eq. (14). For α = 0.1 the proposed method is able to identify the outliers having gaussian distribution only while placing α = 0.3 the proposed method has identified both the gaussian and the sinusoidal outliers in the data. Figure 9 shows the results for the second example with a different noise distribution with fixed values of α and \(\beta\) using Eq. (14).

Fig. 8
figure8

Outlier Detection results using scheme 1 with two different values of α = 0.1 and α = 0.3. Outliers are shown in green and the inlier data in red

Fig. 9
figure9

Second example of Outlier Detection results using scheme 1 with α = 0.3, β = 3 and k = 1. Outliers are shown in green and the inlier data in red

Evaluation (joint probability density estimation) scheme 2

The results for scheme 2 proposed in Eqs. (16)–(17) with different value of parameters are shown in Fig. 10. It can be observed from the results in Fig. 10 that the small value of γ produces sharp density distribution while a larger value of γ produced a smoother distribution. For a small value of γ the inlier data points are also identified as the outliers which is not the case with a comparatively larger value of γ for this particular dataset. However, optimum values for the parameters need to be tuned to get the optimum results using this scheme.

Fig. 10
figure10

Outlier detection results using scheme 2 with different values of α, γ and k. Green points represent outliers and red points represent the normal data points

Evaluation (joint probability density estimation) scheme 3

The idea proposed in Eq. (18) based on the different ways of estimation of gaussians of the distance vector \({d}_{k}\) is evaluated on three different synthetic examples having different distribution of noise and the results are shown in Fig. 11. It can be seen from the visual results depicted in Fig. 11 that this scheme is successful in identifying the outliers of different distributions and even the noisy data that lies in close proximity to the inlier data. The value of G represents the number of gaussians estimated from the \({d}_{k}\).

Fig. 11
figure11

Results achieved on three different datasets with different distribution of noise using the proposed GMM estimation of \({d}_{k}.\) The data points in red are the inliers and the green data points are identified as the outliers

Evaluation of boxplot adjustments using a real example

The ideas proposed for boxplot adjustments in Eq. (6) and Eq. (7) are also evaluated on a real dataset. The dataset used is a subset of the original KDD Cup 1999 dataset from the UCI machine learning repository, the subset used is still a large data containing 95,156 observations and three attributes. The dataset is publicly available online.Footnote 1 The ground truth of the dataset used is shown in Fig. 12a, where the blue data points represent the actual inliers and the yellow points represent the actual outliers. The results for boxplot using extreme values defined in Eq. (6) are shown in Fig. 12b and the achieved value for Area Under Curve (AUC) evaluation parameter is 0.83 for this dataset.

Fig. 12
figure12

a Ground truth of the real data used to evaluate the proposed methods, blue data points are inliers and yellow are the outliers. b Results achieved using the boxplot with extreme values proposed in Eq. 6. Red points are inliers and green are outliers

The results achieved for the proposed idea in (7) are shown in Fig. 13a, b, respectively for Eqs. 7a and 7b. The detected outliers are shown in green color and the inliers are shown in red color. The achieved value of AUC using both Eq. 7a and 7b is 0.833.

Fig. 13
figure13

a Results achieved using the boxplot with extreme values proposed in Eq. 7(a). Red points are inliers and green are outliers. b Ress achieved using the boxplot with extreme values proposed in Eq. 7(b). Red points are inliers and green are outliers

Comparison with State-of-Art

The proposed schemes are compared with several state of the art unsupervised outlier detection algorithms of similar kind, using a variety of benchmark datasets reported in [30]. The details of the benchmark datasets used for comparison are given in Table 2. The algorithms used for comparison include kNN [19], kNN-weight (kNNW) [20, 31], Outlier detection using Indegree Number (ODIN) [32], Local Outlier Factor (LOF) [21], Simplified LOF (SLOF) [33], Connectivity based Outlier Factor (COF) [34], Influenced Outlierness (INFLO) [35], Local Outlier Probabilities (LoOP) [36], Local Distance-based Outlier Factor (LDOF) [37], Local Density Factor (LDF) [38], Kernel Density Estimation Outlier Score (KDEOS) [39], Multi-Objective Generative Adversarial Active Learning (MO-GAAL) [40], Single-Objective Generative Adversarial Active Learning (SO-GAAL) [40], Artificially Generating Potential Outliers (AGPO), Active-Outlier method (AO) [41], Gaussian mixture model (GMM) [42], Parzen [43], One-Class Support Vector Machine (OC_SVM) [44], and Fast Angle-Based Outlier Detection (FastABOD) [45].

Table 2 Details of benchmark datasets used for evaluation and comparison with State-of-art

Initially, some of the fundamental outlier detection algorithms are compared with the proposed algorithms using the same three synthetic datasets. To test the unsupervised outlier detection methods, the most popular evaluation measure proposed in literature is based on the Receiver Operating Characteristics (ROC) and is computed as the Area Under Curve (AUC) [30]. The ROC AUC is computed for these three datasets using the proposed schemes and are compared with some of the fundamental state-of-art algorithms in Table 3. Hyperparameters of all the methods are tuned and the best results are reported. It can be seen from the results in Table 3 that the proposed schemes are performing better than the existing algorithms. As these three datasets are only two dimensional the visual comparison is also possible which is provided in Fig. 14. From the visual inspection it is clearer that the newly proposed methods are better than the existing ones in identifying the outliers lying in close proximity to the inliers. Although, all the proposed schemes out-performed the existing approaches when tested on two dimensional synthetic datasets, only BADk and scheme 3 are recommended for large scale datasets. This is because the computational complexity of scheme 1 and scheme 2 is relatively higher than the BADk and scheme 3, as given in Table 4. Therefore, we recommend scheme 1 and scheme 2 only for dataset with dimensions less than or equal to 3 and for small datasets. For large scale datasets we recommend using scheme 3 and BADk. However, with a compromise on the computational complexity, scheme 1 and scheme 2 have the ability to perform better in terms of ROC AUC on individual datasets. Although, the best running time complexity is achieved by LOF but the proposed methods are performing much better than LOF in terms of ROC AUC values on individual datasets.

Table 3 ROC AUC comparison using three synthetic datasets
Fig. 14
figure14

Visual comparison for outlier detection using three synthetic datasets. Row 1–5: state-of-art methods and Row 6–9: the newly proposed methods

Table 4 Comparison based on computational time using three synthetic datasets

Furthermore, the proposed methods are evaluated and compared with eight existing approaches using five real benchmark datasets and the results are reported in Table 5. From the results in Table 5 it can be seen that the proposed schemes outperformed the existing approaches when tested on five real benchmark datasets. As, the proposed BADk and scheme 3 performed better than the proposed scheme 1 and scheme 2 when tested on synthetic datasets in terms of average ROC AUC value and computational time, therefore only these two methods are included for comparison with existing approaches using the real datasets.

Table 5 Comparison with existing approaches using some real datasets

In order to perform more comprehensive comparison, ten more benchmark datasets and twelve state-of-art methods reported in [30] are also used and are compared with newly proposed unsupervised outlier detection methods. The results for these state-of-art methods and the newly proposed methods for the ten benchmark datasets are given in Table 6. It can be seen from the results in Table 6 that the newly proposed methods are clearly outperforming the existing algorithms in most of the cases. However, for two of the datasets (Hepatits and Parkinson) the proposed scheme 3 is performing equally well as compared to the LDF and both these methods are performing best as compared to the existing state-of-art approaches. For the Shuttle and the WPBC datasets the LDF approach is outperforming all other methods. However, the difference between the performance of LDF with proposed schemes on these two datasets is marginal, especially on WPBC. Furthermore, the two newly proposed methods BADk and scheme 3 make use of the distance vector for detection of outliers, so irrespective of the increasing dimensions of the input data the computational complexity of these proposed algorithms remains low.

Table 6 ROC AUC values computed on different benchmark datasets using State-Of-Art algorithms and the proposed schemes

A visual comparison of proposed methods with the existing state-of-art methods is also provided in Fig. 15. From the results in Fig. 15, it is clearer that the newly proposed scheme 3 is outperforming rest of the methods in terms of AUC. Furthermore, the required computational cost for the proposed method is also low because of using the \({d}_{k}\) vector for outlier detection, instead of using the entire input data dimensions. As the proposed method is using only a single dimension distance vector of outlier detection, this makes it independent of the dimensions of the input data in terms of computational cost, which in turn makes it more feasible for high dimensional data.

Fig. 15
figure15

Comparison of the proposed schemes with the state-of-art methods using 10 benchmark datasets for outlier detection. Y-axis represents the computed ROC AUC values

Conclusions

Outlier detection is one of the most important preprocessing steps in data analytics, and for best performance consideration, it is considered a vital step for machine learning algorithms. Different methods are presented in this paper, keeping in view the need for a robust and easy-to-implement outlier detection algorithm. The newly proposed methods are based on novel statistical techniques considering data compactness, which resulted in an added advantage of easy implementation, improved accuracy, and low computational cost. Furthermore, to demonstrate the proposed ideas' performance, several benchmark multidimensional datasets and three complex synthetic two-dimensional datasets containing the different shapes of clusters contaminated with a mixture of varying noise distributions are used. The proposed methods are found accurate and better in terms of outlier detection as compared to the state-of-art. It is also an observation that some of the fundamental state-of-art methods cannot detect the outliers in scenarios where the outliers are a mixture of two different distributions. Moreover, two of the newly proposed schemes use only a single dimension distance-vector instead of utilizing the entire data dimensions for outlier detection. This makes the proposed methods more feasible and computationally inexpensive, irrespective of the input data's large sizes and growing dimensions.

Moreover, the evaluation of proposed unsupervised outlier detection methods on several benchmark real datasets reveal the usefulness of the proposed methods in detection of multivariate outliers in real datasets. The work can be extending by performing optimization on distance calculation method for the proposed scheme 3 and BADk. This will further enhance the computational complexity of these methods. Moreover, investigation of other distance metrics other than Euclidian can also be studied in future, as this metric suffers a lot for high dimensions.

Availability of data and materials

The datasets analysed during the current study are publicly available at http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/; https://www.dbs.ifi.lmu.de/research/outlier-evaluation/DAMI/.

Notes

  1. 1.

    http://odds.cs.stonybrook.edu/smtp-kddcup99-dataset/.

Abbreviations

ABOD:

Angle-Based Outlier Detection

AUC:

Area Under Curve

BADk:

Boxplot adjustments using D-k-NN

COF:

Connectivity based Outlier Factor

D-k-NN:

Distance vector considering k number of Nearest Neighbors

INFLO:

Influenced Outlierness

KDEOS:

Kernel Density Estimation Outlier Score

LDF:

Local Density Factor

LDOF:

Local Distance-based Outlier Factor

LE:

Lower extreme bound

LOF:

Local Outlier Factor

LoOP:

Local Outlier Probabilities

ODIN:

Outlier Detection using Indegree Number

PDF:

Probability Density Function

ROC:

Receiver Operating Characteristics

SLOF:

Simplified LOF

UE:

Upper extreme bound

References

  1. 1.

    Zhu J, Ge Z, Song Z, Gao F. Review and big data perspectives on robust data mining approaches for industrial process modeling with outliers and missing data. Annu Rev Control. 2018;46:107–33.

    MathSciNet  Article  Google Scholar 

  2. 2.

    McClelland GH. Nasty data: unruly, ill-mannered observations can ruin your analysis. In: Handbook of research methods in social and personality psychology. Cambridge: Cambridge University Press; 2000.

    Google Scholar 

  3. 3.

    Frénay B, Verleysen M. Reinforced extreme learning machines for fast robust regression in the presence of outliers. IEEE Trans Cybern. 2015;46(12):3351–63.

    Article  Google Scholar 

  4. 4.

    Wang X, Wang X, Wilkes M, Wang X, Wang X, Wilkes M. Developments in unsupervised outlier detection research. In: New Developments unsupervised outlier detection. Springer: Singapore; 2021. p. 13–36.

    Chapter  Google Scholar 

  5. 5.

    Zimek A, Filzmoser P. There and back again: outlier detection between statistical reasoning and data mining algorithms. Wiley Interdiscip Rev Data Min Knowl Discov. 2018;8(6):e1280.

    Article  Google Scholar 

  6. 6.

    Chandola V, Banerjee A, Kumar V. Anomaly detection: a survey. ACM Comput Surv. 2009;41(3):1–58.

    Article  Google Scholar 

  7. 7.

    Angelin B, Geetha A. Outlier detection using clustering techniques-K-means and K-median. In: Proceedings of the international conference on intelligent computing control system. ICICCS 2020; 2020. p. 373–8.

  8. 8.

    Bergman L, Hoshen Y. Classification-based anomaly detection for general data. arXiv; 2020.

  9. 9.

    Wahid A, Annavarapu CSR. NaNOD: a natural neighbour-based outlier detection algorithm. Neural Comput Appl. 2020;33:2107–23.

    Article  Google Scholar 

  10. 10.

    Domański PD. Study on statistical outlier detection and labelling. Int J Autom Comput. 2020;17:788–811.

    Article  Google Scholar 

  11. 11.

    Dong Y, Hopkins SB, Li J. Quantum entropy scoring for fast robust mean estimation and improved outlier detection. arXiv; 2019.

  12. 12.

    Shetta O, Niranjan M. Robust subspace methods for outlier detection in genomic data circumvents the curse of dimensionality. R Soc Open Sci. 2020;7(2):190714.

    Article  Google Scholar 

  13. 13.

    Li P, Niggemann O. Non-convex hull based anomaly detection in CPPS. Eng Appl Artif Intell. 2020;87:103301.

    Article  Google Scholar 

  14. 14.

    Borghesi A, Bartolini A, Lombardi M, Milano M, Benini L. Anomaly detection using autoencoders in high performance computing systems. CEUR Workshop Proc. 2019;2495:24–32.

    Google Scholar 

  15. 15.

    Knorr E, Ng R. A unified notion of outliers: properties and computation. In: Proceedings of the 3rd ACM international conference on knowledge discovery and data mining (KDD), Newport Beach; 1997, p. 219–22.

  16. 16.

    Knorr E, Ng R. Algorithms for mining distance-based outliers in large datasets. In: Proceedings of the 24th international conference on very large data bases (VLDB), New York; 1998, p. 392–403.

  17. 17.

    Wu G et al. A fast kNN-based approach for time sensitive anomaly detection over data streams. In: International conference on computational science; 2019, p. 59–74.

  18. 18.

    Zhu R, et al. KNN-based approximate outlier detection algorithm over IoT streaming data. IEEE Access. 2020;8:42749–59.

    Article  Google Scholar 

  19. 19.

    Ramaswamy S, Rastogi R, Shim K. Efficient algorithms for mining outliers from large data sets. In: Proceedings of the ACM international conference on management of data (SIGMOD), Dallas; 2000, p. 427–38.

  20. 20.

    Angiulli F, Pizzuti C. Outlier mining in large high-dimensional data sets. IEEE Trans Knowl Data Eng. 2005;17(2):203–15.

    Article  Google Scholar 

  21. 21.

    Breunig M, Kriegel H, Ng R, Sander J. LOF: identifying density-based local outliers. In: Proceedings of the ACM international conference on management of data (SIGMOD), Dallas; 2000, p. 93–104.

  22. 22.

    Tukey JW. Exploratoy data analysis. Addison-Wesley Ser Behav Sci; 1977.

  23. 23.

    Kimber AC. Exploratory data analysis for possibly censored data from skewed distributions. Appl Stat. 1990;39:21–30.

    MathSciNet  Article  Google Scholar 

  24. 24.

    Aucremanne L, Brys G, Hubert M, Rousseeuw PJ, Struyf A. A study of belgian inflation, relative prices and nominal rigidities using new robust measures of skewness and tail weight. In: Theory and applications of recent robust methods. Basel: Birkhäuser; 2004. p. 13–25.

    Chapter  Google Scholar 

  25. 25.

    Schwertman NC, Owens MA, Adnan R. A simple more general boxplot method for identifying outliers. Comput Stat Data Anal. 2004;47:165–74.

    MathSciNet  Article  Google Scholar 

  26. 26.

    Hubert M, Vandervieren E. An adjusted boxplot for skewed distributions. Comput Stat Data Anal. 2008;52(12):5186–201.

    MathSciNet  Article  Google Scholar 

  27. 27.

    Belhaouari SB, Ahmed S, Mansour S. Optimized K-means algorithm. Math Probl Eng. 2014; 2014.

  28. 28.

    N. Distribution. Encyclopedia.com: https://www.encyclopedia.com/social-sciences/applied-and-social-sciences-magazines/distribution-normal. Gale encyclopedia of psychology.

  29. 29.

    Casella G, Berger RL. Statistical inference, 2nd edn. Duxbury. ISBN 978-0-534-24312-8; 2001.

  30. 30.

    Campos GO, et al. On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study. Data Min Knowl Discov. 2016;30:891–927.

    MathSciNet  Article  Google Scholar 

  31. 31.

    Angiulli F, Pizzuti C. Fast outlier detection in high dimensional spaces. In: Proceedings of the 6th European conference on principles of data mining and knowledge discovery (PKDD), Helsinki; 2002, p. 15–26.

  32. 32.

    Hautamäki V, Kärkkäinen I, Fränti P. Outlier detection using k-nearest neighbor graph. In: Proceedings of the 17th international conference on pattern recognition (ICPR), Cambridge; 2004, p. 430–3.

  33. 33.

    Schubert E, Zimek A, Kriegel H. Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection. Data Min Knowl Discov. 2014;28(1):190–237.

    MathSciNet  Article  Google Scholar 

  34. 34.

    Tang J, Chen Z, Fu A, Cheung D. Enhancing effectiveness of outlier detections for low density patterns. In: Proceedings of the 6th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Taipei; 2002, p. 535–48.

  35. 35.

    Jin W, Tung A, Han J, Wang W. Ranking outliers using symmetric neighborhood relationship. In: Proceedings of the 10th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore; 2006, p. 577–93.

  36. 36.

    Kriegel H, Kröger P, Schubert E, Zimek A. LoOP: local outlier probabilities. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM), Hong Kong; 2009, p. 1649–52.

  37. 37.

    Zhang K, Hutter M, Jin H. A new local distance-based outlier detection approach for scattered real- world data. In: Proceedings of the 13th Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Bangkok; 2009, p. 813–22.

  38. 38.

    Latecki L, Lazarevic A, Pokrajac D. Outlier detection with kernel density functions. In: Proceedings of the 5th international conference on machine learning and data mining in pattern recognition (MLDM), Leipzig; 2007, p. 61–75.

  39. 39.

    Schubert E, Zimek A, Kriegel H. Generalized outlier detection with flexible kernel density estimates. In: Proceedings of the 14th SIAM International Conference on Data Mining (SDM), Philadelphia; 2014, p. 542–50.

  40. 40.

    Liu Y, et al. Generative adversarial active learning for unsupervised outlier detection. IEEE Trans Knowl Data Eng. 2020;32(8):1517–28.

    Article  Google Scholar 

  41. 41.

    Abe N, Zadrozny B, Langford J. Outlier detection by active learning. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, vol. 2006; 2006, p. 504–9

  42. 42.

    Yang X, Latecki LJ, Pokrajac D. Outlier detection with globally optimal exemplar-based GMM. In: Proceedings of the applied mathematics, society for industrial and applied mathematics—9th SIAM international conference on data minning 2009, vol. 1; 2009, p. 144–53.

  43. 43.

    Cohen G, Sax H, Geissbuhler A. Novelty detection using one-class parzen density estimator. An application to surveillance of nosocomial infections. Stud Health Technol Inform. 2008;136:21–6.

    Google Scholar 

  44. 44.

    Platt JC, Shawe-Taylor J, Smola AJ, Williamson RC, Scholkopf B. Estimating the support of a high-dimensional distribution. Neural Comput. 2001;13(7):1443–71.

    Article  Google Scholar 

  45. 45.

    Kriegel H, Schubert M, Zimek A. Angle-based outlier detection in high-dimensional data. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD), Las Vegas; 2008, p. 444–52.

Download references

Acknowledgements

The publication of this article was funded by the Qatar National Library. The authors would like to thank Qatar National Library (QNL) for supporting the publication charges of this article.

Funding

Open access funding provided by the Qatar National Library.

Author information

Affiliations

Authors

Contributions

Both the authors have equally contributed in this work. All authors read and approved the final manuscript.

Authors’ informations

Atiq Ur Rehman received the master’s degree in computer engineering from the National University of Sciences and Technology (NUST), Pakistan, in 2013 and PhD degree in Computer Science and Engineering from Hamad Bin Khalifa University, Qatar in 2019. He is currently working as a Post doc researcher with the College of Science and Engineering, Hamad Bin Khalifa University, Qatar. His research interests include the development of pattern recognition and machine learning algorithms.

Samir Brahim Belhaouri received the master’s degree in telecommunications from the National Polytechnic Institute (ENSEEIHT) of Toulouse, France, in 2000, and the Ph.D. degree in Applied Mathematics from the Federal Polytechnic School of Lausanne (EPFL), in 2006. He is currently an associate professor in the Division of Information and Communication Technologies, College of Science and Engineering, HBKU. He also holds and leads several academic and administrator positions, Vice Dean for Academic & Student Affairs at College of Science and General Studies and University Preparatory Program at ALFAISAL university (KSA), University of Sharjah (UAE), Innopolis University (Russia), Petronas University (Malaysia), and EPFL Federal Swiss school (Switzerland). His main research interests include Stochastic Processes, Machine Learning, and Number Theory. He is now working actively on developing algorithms in machine learning applied to visual surveillance and biomedical data, with the support of several international fund for research in Russia, Malaysia, and in GCC.

Corresponding author

Correspondence to Atiq ur Rehman.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

Authors declare no competing interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

ur Rehman, A., Belhaouari, S.B. Unsupervised outlier detection in multidimensional data. J Big Data 8, 80 (2021). https://doi.org/10.1186/s40537-021-00469-z

Download citation

Keywords

  • Anomaly/outliers detection
  • Advanced statistical methods
  • Computationally inexpensive methods
  • High dimensional data