Skip to main content

Table 1 Assumptions, advantages, and disadvantages of different outlier detection approaches

From: Unsupervised outlier detection in multidimensional data

Method

Assumption(s)

Advantages

Disadvantages

Classification

A classifier with an ability to distinguish among the inlier and the outlier classes can be learnt utilizing the given feature space

1. Multi-class approaches can make use of powerful algorithms to discriminate between instances belonging to different classes

2. Since each test instance needs to be compared against the pre-computed model, therefore, the testing phase is faster

1. Dependent on the availability of accurate labels which is often not possible

Nearest neighbor

Inliers occur in the dense neighborhoods, while the outliers occur far from their closest neighbors

1. Unsupervised in nature and purely data driven

2. Their adaptability to a different data type is straight-forward, and primarily requires a proper distance measure for the given data

1. The performance of these methods degrade if the assumption made is not true

2. The computational complexity of these algorithms is a significant challenge

3. Performance greatly relies on the distance measure used

Clustering

1. Outliers do not belong to any cluster while the inliers belong to a cluster in the data

2. Outliers are far away from their closest cluster centroid while inliers lie close to their closest cluster centroid

3. Outliers either belong to small or sparse clusters while the inliers belong to large and dense clusters

1. Ability to operate in an unsupervised mode

2. Ability to adapt to other complex data types by simply using a clustering algorithm that can handle the particular data type

3. Since the number of clusters against which every test instance needs to be compared is a small constant, therefore, the testing phase is fast

1. Highly dependent on the effectiveness of clustering algorithm in capturing the cluster structure of inlier data points

2. Many techniques detect anomalies as a by-product of clustering, and hence are not optimized for anomaly detection

3. Several clustering based techniques are effective only when the outliers do not form significant clusters among themselves

Statistical

Outliers occur in the low probability regions of the stochastic model, while the inliers occur in high probability regions of a stochastic model

1. Provide a statistically justifiable solution if the assumptions regarding the underlying data distribution hold true

2. The anomaly score is associated with a confidence interval which can be used as additional information while making a decision regarding any test instance

3. If the distribution estimation step is robust to outliers in data, these techniques can operate in a unsupervised setting

1. Rely on the assumption that the data is generated from a particular distribution which is often not true especially for high dimensional real data sets

2. Constructing hypothesis tests for complex distributions that are required to fit high dimensional data sets is nontrivial

Information theoretic

Outliers in data induce irregularities in the information content of the data set

1. Ability to operate in an unsupervised setting

2. They do not make any assumptions about the underlying statistical distribution for the data

1. Sensitive towards the choice of the information theoretic measure. Often, these measures can only detect the presence of outliers when there are significantly large number of outliers present in the data

2. When applied to spatial and sequential data sets, these techniques rely on the size of substructure, which is often nontrivial to obtain

3. It is difficult to associate an outlier score with a test instance

Spectral

Data can be transformed into a lower dimensional subspace, where the inliers and the outliers appear significantly different

1. Automatically perform dimensionality reduction, therefore, these are suitable for high dimensional datasets

2. Can be used in an unsupervised setting

1. Useful only if the inliers and outliers are separable in the lower dimensional transformation of the data

2. Typically have high computational complexity