This research is primarily developed based on the previous research done by Megantara and Ahmad [28], which focuses on the feature selection process. This previous study proposed a feature importance ranking based on the decision tree method for selecting features and wrapper methods using recursive feature elimination to calculate the subset score. In this research, we propose a scheme called Hybrid Machine Learning Method, which combines both feature selection and data reduction processes. The detailed explanation of this research's proposed method can be described as follows (see Fig. 1).
-
In the feature selection process, the importance value to the dataset label of each feature is calculated. We introduce a threshold mechanism by removing features with zero values to separate high importance features from low importance features and divide the rest with median data.
-
In the data reduction process, we use the local outlier factor for detecting outliers for each data point. Normal/Gaussian distribution will be used to configure the cut-off value for the anomaly score.
Feature importance ranking
The purpose of the feature selection process is only to take significantly relevant features with the label decision. Feature selection is divided into three mechanisms: Filter-based, wrapper-based, and embedded-based methods. The filter-based method first selects features using various statistical methods or algorithms to calculate each feature’s important/significant value. Differently, the wrapper-based method chooses the dataset’s features by finding the best possible subset combination. In this research, we explore the embedded-based method by combining both filter-based and wrapper-based methods. The dataset’s significant/essential features are generated by calculating the probability number of the data for each feature in the dataset. It is to reach the decision node of the decision tree-based method [29]. Here, Eq. (1) [29] is applied to calculate each node’s importance value for every feature in the dataset. In this formula, \({ni}_{i}\) is the importance values of each node j, \({w}_{j}\) is the weighted value of each instance in node j, \({C}_{j}\) is the impurities value of node j and j, \(left(j)\) and \(right(j)\) is the child nodes.
$${ni}_{i}= {w}_{j}{C}_{j}- {w}_{left(j)}{C}_{left(j)}- {w}_{right(j)}{C}_{right(j)}$$
(1)
Since each node’s importance values are generated, each feature’s important/significant value can be calculated using Eq. (2) [29]. Here, \({fi}_{i}\) is the importance/significance of each feature, and \({ni}_{i}\) is the importance value of each node.
$${fi_{i}}= \frac{{{\sum}_{j:node\, j\, split\, on\, feature}}{ni_{j}}}{{\sum}_{k\in all\, nodes}{ni_{k}}}$$
(2)
After each feature’s importance values are obtained, a threshold for separating selected features from those not selected in the dataset is configured. For this purpose, we propose a method whose mechanism is given in Fig. 2.
In the previous research, especially most filter-based method feature selection, the threshold value is manually configured. It can be done by testing every possible generated value. Those methods still suffer from several problems, mainly because this value is user-dependent. So, in this research, we present a mechanism to solve these problems. Firstly, the importance values of each feature will be sorted from the highest to the lowest values. Features with zero importance value mean that they are irrelevant to the label decision; therefore, those features will be removed first. Secondly, the rest of the features will be classified into two groups with a median value as the threshold. This value in the data set can be calculated using Eq. (3) for the odd and Eq. (4) for the even number of data.
$$Me={X}_{\frac{(n+1)}{2}}$$
(3)
$$Me=\frac{{X}_\frac{n}{2}+{X}_{\frac{\left(n+1\right)}{2}}}{2}$$
(4)
Here, \(X\) is the middle value, and \(n\) is the total number of data.
Local Outlier Factor
The Local Outlier Factor (LOF) is to calculate the local outlier for each data point. The LOF process produces a score known as an anomaly score, representing how far each data points to its neighbors. The higher the anomaly score in the data point, the far the data point from other data, which is called anomaly data [30].
The LOF method is divided into four steps: Determining the k-value to initiate the cluster size, calculating the reachability distance for each data point, calculating the local reachability distance value, and generating the LOF score/anomaly score for each data point. The illustration of the LOF method is provided in Fig. 3. It is defined that the k-value is that to initiate the size of a cluster. This value will affect how anomaly data can be generated.
Reachability distance (RD) is the distance from each data point to its maximum distance of k-value data points. The function of the reachability distance is to find the perimeter area for each data points for mapping the number of other nearest data points. If there are many other data points inside the perimeter area, it means that the data are not outliers. This value can be calculated using Eq. (5) [30], where \(RD\) is the reachability distance value of each data point, \(K\) is the k-value data points, XA is the data points, and XAʹ is the furthest distance data point.
$$\mathrm{RD }\left({X}_{A},{X}_{A{\prime}}\right)=\mathrm{max}(K-\mathrm{distance}\left({X}_{A{\prime}}\right),\mathrm{distance}({X}_{A},{X}_{A{\prime}}))$$
(5)
After the RD value is obtained, the Local Reachability Distance (LRD) value is calculated using Eq. (6) [30] to determine the distance ratio for every nearest neighbor inside the cluster, where LRD is the local reachability distance value and Nk(A) is the K-Neighbors.
$${LRD}_{k}\left(A\right)= \frac{1}{{\sum }_{{X}_{j} \in {N}_{k}(A)}\frac{RD(A, {X}_{j})}{|{N}_{k}(A)|}}$$
(6)
The LRD value is to calculate the anomaly score/LOF score for each data point. The LOF score is the ratio between the LRD value of each data point and all the data points. It is used for comparing the distance ratio between each data point and the other data points. To calculate the anomaly score/LOF score, Eq. (7) [30] is applied, where LOF is the LOF score/anomaly score, LRD is the local reachability distance values, and Nk(A) is the K-Neighbours.
$${LOF}_{k}\left(A\right)= \frac{{\sum }_{{X}_{j}\in {N}_{k}(A)}{LRD}_{k}({X}_{j})}{|{N}_{k}(A)|}\frac{1}{{LRD}_{k}\left(A\right)}$$
(7)
The generated LOF score is then distributed using a standard normal distribution (Gaussian distribution) to configure the cut-off value between normal and anomaly data. First, the LOF score is converted into Z-score to determine how far each LOF score is from the mean data in the dataset. For converting the LOF score to the Z-score, Eqs. (8–10) are performed.
$$m= \frac{1}{n}\sum_{i}^{n}{X}_{i}$$
(8)
$$std= \sqrt{\frac{\sum_{i=1}^{n}{({X}_{i}-m)}^{2}}{{n}_{i}}}$$
(9)
$$Z= \frac{(X-m)}{std}$$
(10)
Equations (8–10) are the formula for converting the LOF score/anomaly score into the Z-score, where \(m\) is the mean value, \(std\) is the standard deviation, \(z\) is the Z-score value, \(n\) is the total data number in the dataset, and \(X\) is the LOF score/anomaly score.
After the Z-score of each data point is generated, they are distributed to the standard normal distribution. In this standard normal distribution, the \(x\)-axis is the data point values, and the \(y\)-axis is the Z-score values. The standard deviation is calculated as the cut-off value for generating the data subsets. The mean score from the set of data is the maximum score of data distribution. Standard deviation 3 means that 99.73% of data will be used while the rest will be removed, while standard deviation 2 means that 95.45% of data will be used while the rest will be removed. Furthermore, standard deviation 1 represents that 68.27% of data will be used, and the rest will be removed. These cut-off values depict how far each LOF score/anomaly score is to its maximum score.