Unsupervised outlier detection in multidimensional data

,

cluster, and the observations that are not assigned to any cluster are identified as outliers. The nearest neighbor techniques are mostly based on a calculation of the distance or similarity measure between the observation and its neighboring observations. Suppose the calculation is greater than a certain threshold, that means that the observation lies far apart from the rest of the observations and is considered as an outlier. Statistical methods usually fit a statistical distribution (mostly normal distribution) to the data and conduct a statistical inference test to see if the observation belongs to the same distribution or not. If not, the observation is marked as an outlier. Information-theoretic techniques use different information theoretic measures for example entropy, relative entropy, etc., to analyze the information content of the data. These techniques are based on an assumption that the outliers or anomalies in the data induce irregularities in the information content. Spectral methods transform the data to a new dimensional space such that the outliers are easily identified and separated from the data in the new space. Furthermore, some outlier detection techniques are also based on geometric methods [13] and neural networks [14].
All the techniques mentioned above are based on some assumptions and all the techniques have some pros and cons as described in Table 1. The ideas proposed in this work are based on the novel statistical methods considering data properties like compactness. Where, the compactness of data is estimated either by the interactions between different kernel function tails, new and adapted kernel functions are proposed, or by the variance of independent gaussian distributions over different regions that are captured as a new clustering idea. Moreover, contrary to the existing approaches, statistical methods are modeled based on the transformation of data into a unidimensional distance space. The newly proposed methods are based on boxplot adjustment, kernel based probability density estimation and neighborhood information. The aim here is to utilize the power of some statistical methods and to enhance the performance of the outlier detection algorithms in an unsupervised way, while keeping their implementation easy and computationally efficient. The proposed methods are evaluated using both the synthetic and the real datasets and are found better in scenarios where the traditional approaches fail to perform well. Especially the cases where the data is contaminated with a mixture of different noise distributions.
The rest of the paper is organized as follows: "Background" section, "Proposed methods" section, Evaluation on synthetic datasets "Evaluation using synthetic examples" section, Evaluation on real data "Evaluation of boxplot adjustments using a real example" section, "Comparison with State-of-Art" section and "Conclusions" section.

Background
After the initial pivotal works for outlier detection based on the distance measure [15,16], several new methods based on the distance measure were also proposed by different authors in literature [17,18]. The difference between the latterly proposed methods and the previous studies is the use of nearest neighbors in distance calculation. Among the variants of the actual work, either a single distance based on the kth closest neighbor is calculated [19] or the aggregate of the distance of k closets points is calculated [20]. Among other unsupervised outlier detection algorithms are the local approaches originated from the concept of Local Outlier Factor [21]. ur Rehman and Belhaouari J Big Data (2021) 8:80  Furthermore, boxplot outlier detection scheme is also one of the fundamental unsupervised approach and the concept of univariate boxplot analysis was first proposed by Tukey et. al. [22]. In a univariate boxplot, there are five parameters specified as: (i) the upper extreme bound (UE), (ii) the lower extreme bound (LE), (iii) the upper quartile Q3 (75th percentile), (iv) the lower quartile Q1 (25th percentile) and (v) the median Q2 (50th percentile). The best way to estimate the extreme boundaries is to estimate Probability Density Function (PDF), f (x) , at first step from where the boundaries will be defined, as follows: where τ is the significance level, the region of suspected outliers is defined for τ = 0.05 and the region of extremely suspected outliers is defined for τ = 0.01 . The Eq. (1) estimates well the boundaries only if the distribution is unimodal, i.e., a distribution that has single peak or at most one frequent value.
However, in a standard boxplot the UE and LE values are computed and well estimated only under the assumption that the PDF is symmetric, as: where the term IQR is defined as the Inter Quartile Range and is given by: A common practice to identify the outliers in a dataset using a boxplot is to mark the points that lie outside the extreme values, that is, the points greater than UE and less than LE are identified as outliers. This version of outlier detection scheme works well for the symmetric data. However, for skewed data different other schemes are proposed in the literature. For example, different authors have used the semi-interquartile range i.e. Q3 − Q2 and Q2 − Q1 to define the extreme values as: where c 1 and c 2 are the constants and different authors have adjusted their values differently for example, c 1 = c 2 = 1.5 [23] c 1 = c 2 = 3 [24] or calculation based on the expected values of the quartiles [25] and few more adjustments to the boxplot for outliers detection are also available, for example [26].
The traditional methods of boxplot for detecting the outliers sometimes fails in situations where the noise in the data is a mixture of distributions, multimodal distribution, or in the presence of small outlier clusters. In this paper, some novel statistical schemes based on (i) the boxplot adjustments, and (ii) a new probability density estimation using k-nearest neighbors' distance vector are proposed to overcome the problem faced by traditional methods. These proposed methods are described in detail in the next section. (1) . ur Rehman and Belhaouari J Big Data (2021) 8:80 Proposed methods

Boxplot adjustments using D-k-NN (BADk)
Traditional boxplot identifies the outliers from unique dimensions. One useful idea other than using all the dimensions for identifying outliers is to transform the data into a unidimensional distance space and to identify the outliers in new space. This can simply be done by measuring the distance between data points considering all the dimensions and calculating the resulting distance vector. The idea of using a single dimension distance vector is useful not only for avoiding problem of sorting data in high dimension but also in terms of computational cost, and can be further enhanced in terms of performance by extending it to consider k number of neighbors in the distance calculation. This idea of boxplot adjustment based on the Distance vector considering k number of Nearest Neighbors (D-k-NN) is presented here and the resulting extreme values estimation from the modified boxplot are found to be quite useful in identifying the right outliers. Furthermore, the proposed scheme is useful in the cases where the distribution of the noise is not normal or is a mixture of different distributions, and can identify small outlier clusters in the data. Suppose a dataset in R N , this dataset is transformed from N dimensional space to a unidimensional distance space by using a distance metric such as 'Euclidian distance' . This is done by computing the distance of each observation in N dimensional space to its kth closest neighbor. This transformation results in a set that contains the distance of each observation to its kth closest neighbor, and the resulting set is represented as d k ∈ R . This transformation can be represented as: The set d k is used for computing the extreme value of the boxplot as follows: From the extreme values defined in (6), the outliers are identified as points those lie outside the boundaries of LE d k and UE d k . The two constant values c 1 and c 2 are adjustable with respect to the dataset under consideration and selection of smaller values for these constants results in more points being marked as outliers. The suggested values of these constants by different authors in the literature are c 1 = c 2 = 1.5 or c 1 = c 2 = 3.
Furthermore, another useful idea to identify the outliers in a data is to adjust the UE and LE values of a boxplot as follows: . ur Rehman and Belhaouari J Big Data (2021) 8:80 where var is defined as the variance and the quartiles are computed from the set d k ∈ R .
The extreme values can also be estimated based on the calculation of the separation threshold between centers of two variances, see Eq. (9), as: where M is a value that separate the one-dimension region in order to calculate the variance of two centers. Let xǫR be any random variable with PDF f (x) ; and the values of µ 1 and µ 2 are calculated such that: Both, Var 1 and Var 2 can be partially differentiated with respect to µ 1 and µ 2 respectively, to find the minimum. After simplification the minimization occurs when: where X − = X.1 X<M and X + = X.1 X≥M. and the value of M is calculated as: For further details on the idea proposed in (8)- (11), the readers are referred to [27]. Detecting outliers based on Boxplot is efficient only if the data is unimodal distribution. To overcome the drawbacks of the boxplot estimation, some other statistical methods based on the probability density estimation computed from either the set d k ∈ R or the actual data D ∈ R N are also proposed for outlier's detection, which are discussed below.

Joint probability density estimation using D-k-NN
The methods proposed in this section compute the set d k from the actual data and utilize it for estimating some parameters of the joint distribution function. Three different schemes are proposed here which are described as follows: Scheme 1: Normal distributions are often used for representing the real value random variables with unknown distributions [28,29]. The joint probability density function of independent and identically normal distribution is given as: where ζ is the standard deviation modeled differently in (15) and (17), µ is the mean of the random variable and N is the dimension of the data. Here, some functions based on the normal distribution to identify the outliers in a dataset are proposed. Suppose a twodimensional dataset D(x, y) , we can define a separation threshold T based on the normal distribution for detecting the outliers such that: where Z is joint probability distribution function after normalization, n is the total number of observations and the function f x, y can be defined as: The σ in Eq. (14) can be computed as: where Q3 d k is the third quartile computed from the set d k as defined in Eq. (5) and β is a constant value. The points below the threshold value T defined in Eq. (13) are considered as outliers and the points above T are considered normal inlier data points. The α used in Eq. (13) is the significance value and it can be used to control the percentage amount of data to be removed as outliers. Scheme 2: To better detect the outliers, a better function f x, y needs to be constructed in order to weaken the position of the outlier in terms of support and amplitude of the function. Furthermore, another scenario can be defined to detect the outliers based on the threshold defined in (13) by using the below function: where k defines the kth closest neighbor for the distance metric and γ is a constant whose value can be adjusted to control the smoothness of the gaussian distribution. The concept is demonstrated in Fig. 1, where (a) shows the effect of traditional gaussian approach on compression and (b) shows the effect of proposed scheme 2 on compression.
Both of the above schemes proposed in this section are based on a single gaussian distribution and are expected to work well for the datasets which can be well approximated using a single gaussian distribution. However, if a dataset can be better approximated using multiple gaussians then a better idea is to use a model based on the variable number of gaussians. A new and robust estimation of multiple gaussian distribution is proposed in the next subsection.
(1 + d k ) 2 ur Rehman and Belhaouari J Big Data (2021) 8:80 Scheme 3: The scenarios where the data is estimated using a gaussian distribution, the outliers are identified as the points lying on the extreme tails of the gaussian distribution, as shown in Fig. 2a. However, if the better estimation of underlying data is possible through multiple gaussians, the outliers located at the connecting points of different gaussians might remain unidentified using a single gaussian estimation. In order to identify the outliers existing at the connecting points of the multiple gaussians, an idea based on multiple gaussian estimation is proposed, where a Rejection Area (RA) is defined and computed as: where Cv is defined as a critical value or a threshold value below which is the rejection area or where the outliers are identified, and τ is the significance level. The concept is shown in Fig. 2b, where as an example a single dimensional data is estimated using two gaussians and the outliers can be identified as the points below Cv.
In order to find the optimum number of gaussians that better approximate the joint probability distribution for a given dataset the sorted values of the vector d k can be where R j represents the jth region, m is the total number of estimated gaussians and n i is the total number of elements in the respective region. The combined multiple gaussians model is then estimated by: where In order to determine the regions, lets define an application S U that sorts any given sequence U i , i = 1, . . . , N such that U S U (1) ≤ U S U (2) ≤ · · · ≤ U S U (N ) . For any given data − → X , the sorted data can be represented as − → X S U (i) and suppose that −→ �X S U (i) represents the difference between two consecutive elements of − → X S U (i) . Similarly, the sorted difference can be represented as −→ �X S �X (S U (i)) . In order to define the regions, the elements are grouped together sequentially until�X S �X (S U (i)) ≤ −→ �X S �X (S U (N −m+1)) , once this condition is not true, start grouping the remaining elements as a new region until all the elements are assigned to a region.

Evaluation using synthetic examples
The ability of proposed methods is demonstrated here by the use of some two-dimensional synthetic datasets. The results for each of the proposed method are discussed in the following subsections. Figure 4a shows an example of the data used for evaluating the proposed methods to detect the outliers. From the data shown in Fig. 4a, it can be seen that the actual data is composed of different clusters with different shapes and is contaminated by a mixture of sinusoidal and gaussian distribution of noise. The aim here is to detect the noise as outliers and different shape clusters as inliers. The traditional boxplot is used to detect the outliers from the data and the resulting boxplot is shown in Fig. 4b. It can be seen from the boxplot in Fig. 4b that the traditional boxplot is unable to identify any outliers in the data.

Evaluation of boxplot adjustments using D-k-NN (BADk)
The same dataset is used to evaluate the proposed adjusted boxplot with extreme values define in Eq. (6) and the results are shown in Fig. 5. Different values of k are used to see how it effects the outcome in identifying the outliers. It can be observed from the results shown in Fig. 5 that for the smaller values of k only the gaussian noise is identified and while we keep on increasing the value of k the outliers with sinusoidal distribution are also identified. However, after a certain value of k the data points from the actual clusters (inliers) are also marked as the outliers, while the outliers started to reappear as the inliers. This shows that although the selection of value of k is flexible in this case, still an optimum value of k has to be selected for the optimum performance based on the data. Another example shown in Fig. 6a with a different distribution of noise is also tested for evaluating the ability of the proposed method in (6) for outlier detection. The results for this example are shown in Fig. 7 using three different values of k . It can be seen from the results in Fig. 7 that the selection of value of k is very flexible and

Evaluation (joint probability density estimation) scheme 1
The density estimation refers to estimation of an unobservable Probability Density Function (PDF) associated with an observable data. The PDF gives an estimate of the density according to which a large population in a data is distributed. In this proposed method, the PDF is computed by placing a gaussian density function at each data point, and then summing the density functions over the range of data, and a threshold value α defines the margin between the inlier data and the outliers. The value of α is computed as a percentage amount of the maximum value of the PDF. The value of σ in Eq. (14) is computed utilizing the d k vector as defined in Eq. (15). The results for scheme 1 when evaluated using the same example data as shown in Fig. 4a are given in Fig. 8. The example is evaluated using only two different values of α and a fixed value of β . The outliers are shown in green color and the inlier data is shown in red color. Figure 8 also shows the associated 3D plots of the probability density estimations computed using Eq. (14). For α = 0.1 the proposed method is able to identify the outliers having gaussian distribution only while placing α = 0.3 the proposed method has identified both the gaussian and the sinusoidal outliers in the data. Figure 9 shows the

Evaluation (joint probability density estimation) scheme 2
The results for scheme 2 proposed in Eqs. (16)- (17) with different value of parameters are shown in Fig. 10. It can be observed from the results in Fig. 10 that the small value of γ produces sharp density distribution while a larger value of γ produced a smoother distribution. For a small value of γ the inlier data points are also identified as the outliers which is not the case with a comparatively larger value of γ for this particular dataset. However, optimum values for the parameters need to be tuned to get the optimum results using this scheme.

Evaluation (joint probability density estimation) scheme 3
The idea proposed in Eq. (18) based on the different ways of estimation of gaussians of the distance vector d k is evaluated on three different synthetic examples having different distribution of noise and the results are shown in Fig. 11. It can be seen from the visual results depicted in Fig. 11 that this scheme is successful in identifying the outliers of different distributions and even the noisy data that lies in close proximity to the inlier data. The value of G represents the number of gaussians estimated from the d k .

Evaluation of boxplot adjustments using a real example
The ideas proposed for boxplot adjustments in Eq. (6) and Eq. (7) are also evaluated on a real dataset. The dataset used is a subset of the original KDD Cup 1999 dataset from the UCI machine learning repository, the subset used is still a large data containing 95,156 observations and three attributes. The dataset is publicly available online. 1 The ground truth of the dataset used is shown in Fig. 12a, where the blue data points represent the actual inliers and the yellow points represent the actual outliers. The results for boxplot using extreme values defined in Eq. (6) are shown in Fig. 12b and the achieved value for Area Under Curve (AUC) evaluation parameter is 0.83 for this dataset. The results achieved for the proposed idea in (7) are shown in Fig. 13a, b, respectively for Eqs. 7a and 7b. The detected outliers are shown in green color and the inliers are shown in red color. The achieved value of AUC using both Eq. 7a and 7b is 0.833.
Initially, some of the fundamental outlier detection algorithms are compared with the proposed algorithms using the same three synthetic datasets. To test the unsupervised outlier detection methods, the most popular evaluation measure proposed in literature is based on the Receiver Operating Characteristics (ROC) and is computed as the Area Under Curve (AUC) [30]. The ROC AUC is computed for these three datasets using the proposed schemes and are compared with some of the fundamental state-of-art algorithms in Table 3. Hyperparameters of all the methods are tuned and the best results are reported. It can be seen from the results in Table 3 Fig. 10 Outlier detection results using scheme 2 with different values of α, γ and k. Green points represent outliers and red points represent the normal data points ur Rehman and Belhaouari J Big Data (2021) 8:80 that the proposed schemes are performing better than the existing algorithms. As these three datasets are only two dimensional the visual comparison is also possible which is provided in Fig. 14. From the visual inspection it is clearer that the newly proposed methods are better than the existing ones in identifying the outliers lying in close proximity to the inliers. Although, all the proposed schemes out-performed the existing approaches when tested on two dimensional synthetic datasets, only BADk and scheme 3 are recommended for large scale datasets. This is because the computational complexity of scheme 1 and scheme 2 is relatively higher than the BADk and scheme 3, as given in Table 4. Therefore, we recommend scheme 1 and scheme 2 only for dataset with dimensions less than or equal to 3 and for small datasets. For large scale datasets we recommend using scheme 3 and BADk. However, with a compromise on the computational complexity, scheme 1 and scheme 2 have the ability to perform better in terms of ROC AUC on individual datasets. Although, the best running time complexity is achieved by LOF but the proposed methods are performing much better than LOF in terms of ROC AUC values on individual datasets. Furthermore, the proposed methods are evaluated and compared with eight existing approaches using five real benchmark datasets and the results are reported in Table 5. From the results in Table 5 it can be seen that the proposed schemes outperformed the existing approaches when tested on five real benchmark datasets. As, the proposed BADk and scheme 3 performed better than the proposed scheme 1 and scheme 2 when tested on synthetic datasets in terms of average ROC AUC value and computational time, therefore only these two methods are included for comparison with existing approaches using the real datasets.
In order to perform more comprehensive comparison, ten more benchmark datasets and twelve state-of-art methods reported in [30] are also used and are compared with  newly proposed unsupervised outlier detection methods. The results for these state-ofart methods and the newly proposed methods for the ten benchmark datasets are given in Table 6. It can be seen from the results in Table 6 that the newly proposed methods are clearly outperforming the existing algorithms in most of the cases. However, for two of the datasets (Hepatits and Parkinson) the proposed scheme 3 is performing equally well as compared to the LDF and both these methods are performing best as compared to the existing state-of-art approaches. For the Shuttle and the WPBC datasets the LDF approach is outperforming all other methods. However, the difference between the performance of LDF with proposed schemes on these two datasets is marginal, especially on WPBC. Furthermore, the two newly proposed methods BADk and scheme 3 make use of the distance vector for detection of outliers, so irrespective of the increasing  dimensions of the input data the computational complexity of these proposed algorithms remains low.
A visual comparison of proposed methods with the existing state-of-art methods is also provided in Fig. 15. From the results in Fig. 15, it is clearer that the newly proposed scheme 3 is outperforming rest of the methods in terms of AUC. Furthermore, the required computational cost for the proposed method is also low because of using the d k vector for outlier detection, instead of using the entire input data dimensions. As the proposed method is using only a single dimension distance vector of outlier detection, this makes it independent of the dimensions of the input data in terms of computational cost, which in turn makes it more feasible for high dimensional data.

Conclusions
Outlier detection is one of the most important preprocessing steps in data analytics, and for best performance consideration, it is considered a vital step for machine learning algorithms. Different methods are presented in this paper, keeping in view the need for a robust and easy-to-implement outlier detection algorithm. The newly proposed methods are based on novel statistical techniques considering data compactness, which resulted in an added advantage of easy implementation, improved accuracy, and low computational cost. Furthermore, to demonstrate the proposed ideas' performance, several benchmark multidimensional datasets and three complex synthetic two-dimensional datasets containing the different shapes of clusters contaminated with a mixture of varying noise distributions are used. The proposed methods are found accurate and better in terms of outlier detection as compared to the state-of-art. It is also an observation that some of the fundamental state-of-art methods cannot detect the outliers in scenarios where the outliers are a mixture of two different distributions. Moreover, two of the newly proposed schemes use only a single dimension distance-vector instead of utilizing the entire data dimensions for outlier detection. This makes the proposed methods more feasible and computationally inexpensive, irrespective of the input data's large sizes and growing dimensions.
Moreover, the evaluation of proposed unsupervised outlier detection methods on several benchmark real datasets reveal the usefulness of the proposed methods in detection of multivariate outliers in real datasets. The work can be extending by performing optimization on distance calculation method for the proposed scheme 3 and BADk. This will further enhance the computational complexity of these methods. Moreover, investigation of other distance metrics other than Euclidian can also be studied in future, as this metric suffers a lot for high dimensions.