A reconstruction error-based framework for label noise detection

Label noise is an important data quality issue that negatively impacts machine learning algorithms. For example, label noise has been shown to increase the number of instances required to train effective predictive models. It has also been shown to increase model complexity and decrease model interpretability. In addition, label noise can cause the classification results of a learner to be poor. In this paper, we detect label noise with three unsupervised learners, namely principal componentanalysis(PCA)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{principal component analysis} \hbox { (PCA)}$$\end{document}, independent componentanalysis(ICA)\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{independent component analysis} \hbox { (ICA)}$$\end{document}, and autoencoders. We evaluate these three learners on a credit card fraud dataset using multiple noise levels, and then compare results to the traditional Tomek links noise filter. Our binary classification approach, which considers label noise instances as anomalies, uniquely uses reconstruction errors for noisy data in order to identify and filter label noise. For detecting noisy instances, we discovered that the autoencoder algorithm was the top performer (highest recall score of 0.90), while Tomek links performed the worst (highest recall score of 0.62).

is typically one class label, and this label often has a significant effect on classification performance [1,6].
There are several common issues that can cause label noise. It may stem from inadequate information provided to the expert that, in turn, results in inaccurate labeling [1,7]. Label noise may also be caused by the limited specificity of classification [7], which can decrease the true accuracy of the classification. Thirdly, label noise may be a consequence of the unreliable quality of information [8]. To alleviate these issues, various manual and/or automated services are used to assist with classification [1]. Amazon Mechanical Turk, made popular with its inexpensive and user-friendly labeling processes, provides one such service [9].
Label noise can also come from data encoding or network connectivity problems [7,10]. Real-world databases are believed to contain about 5% encoding errors in their data [6]. As a result of these errors, there is a certain amount of unavoidable label noise [11]. Unfortunately, this necessitates the inclusion of either additional features or an increased training data size to compensate for this inherent label noise [7].
Our work is grounded on the concept that label noise data points are likely to be anomalies in the class indicated by the corresponding label [8,12]. Stated otherwise, after a feature space is defined for a binary class, there will be some instances in an assigned class that belong to the other class due to label noise. Our contribution involves the novel approach of treating reconstruction error as a mapping technique in order to detect label noise. This error is the difference in mean square error (MSE) between the input vector and the reconstructed output produced.
Our machine learning models are trained by minimizing reconstruction error. Three unsupervised learning algorithms were used in this study: principal component analysis (PCA) , independent component analysis (ICA) and autoencoders. They have previously been used by other researchers to detect anomalies in datasets [8,10,13]. We compared these unsupervised methods to Tomek links, a traditional label noise filter technique for removing data capable of impairing classification performance [14,15].
In this study, our autoencoder model produced the best scores for all three metrics used: Area Under the Receiver Operating Characteristic Curve (AUC) , recall, and False Negative Rate (FNR) . Our results also showed that the Tomek links algorithm was the worst performer. For a given metric, it was generally observed that for each algorithm, the highest noise level yielded the best score, and the lowest noise level yielded the worst score.
The remainder of this paper is organized as follows: Section "Related Work" provides an overview of literature related to label noise detection; Section "Background" provides relevant information on all algorithms used in this study ( PCA , ICA , autoencoders, Tomek links); Section "Methodology" describes the training and testing of the unsupervised learners, the levels of noise used, and the metrics used; Section "Results and Discussion" presents and discusses our empirical results; and Section "Conclusion" concludes our paper with a summary of the work presented, along with suggestions for related future work.

Related work
There are two main techniques that address label noise. The first approach filters noisy instances and cleans the data, while the second develops methods that are robust when facing label noise during the modeling process.
In the filtering approach, noisy instances are removed or relabeled to get a cleaner dataset [16]. One of the earlier works on noise filtering is edited nearest neighbor (ENN) by Wilson [17]. This algorithm removes an instance in a training set if the point does not agree with the majority of its k-nearest neighbors (k-NNs) . Editing of the instances is done using the three-nearest neighbor rule followed by classification using the single-nearest neighbor rule. Tomek et al. [14] later suggested an extension of ENN . In their approach, they apply the nearest neighbor rule k times. For each execution, the nearest neighbor rule changes the number of neighbors considered between 1 and k. If one instance is misclassified by the rule, it is removed. In another related study, Khoshgoftaar et al. [18] introduced iterative partitioning filtering. This approach eliminates noisy instances in multiple iterations until a stopping criteria is met. The iterative process ceases if, for a number of consecutive iterations, the number of noisy examples found in each of these iterations is less than a proportion of the size of the original training dataset. Each iteration splits the current training dataset into subsets of equal sizes, followed by the creation of a C4.5 decision tree model for each of these subsets. A voting strategy is then used (consensus or majority voting) to identify noisy instances. Sáez et al. [19] performed a more recent study, where they introduced an iterative class noise filter based on the fusion of classifiers. Their technique eliminates noisy instances in multiple iterations. It is based on three major noise filter paradigms: ensemble-based filtering, iterative filtering, and metric-based filtering. For each iteration, the first step eliminates part of the existing noise in the current iteration to reduce its effect in the subsequent steps. Our study also adopts a filtering approach. However, we focus on unsupervised learning, which is ideal for studying raw data.
With regard to the development of robust methods, the aim is to render mislabeled instances less destructive to the model. Strategies such as robust loss functions [20] that prevent overfitting can be implemented during training to make the classifier less responsive to noisy data. In addition, several studies on label noise detection have explored ensemble learning, which is only suitable for comparatively simple cases [21]. On the other hand, several approaches are explicitly designed for label noise problems using various robust optimization methods [22]. We point out that these are primarily supervised approaches [23]. Zhang and Tan [24] used a reconstruction error minimization framework to further filter and relabel the mislabeled data. Their results, which are based on the MNIST dataset [25], show that their proposed method could significantly reduce the false positive outlier detection rate and improve both data cleaning and classification quality. The authors follow an unsupervised approach just like we did. However, they use a dataset of images.

PCA
In machine learning, PCA is an exploratory data analysis method that reveals the inner structure of the data and explains variance. To create a more compact feature space, the technique identifies correlated variables and effects a transformation that best reflects the differences from the input [26]. The application of PCA results in m principal components. If the amount of variance captured by the first r-principal components represents the majority of the variance in the data, the data can be mapped to the reduced r-dimensional space represented by the first r-principal components.
PCA is typically used for dimension reduction. In this study, however, we use it for label noise detection by applying the subspace method [27]. The subspace method works by dividing the principal axes into two sets representing normal and anomalous data variations. Any data instance represented by a row in the dataset can be defined as y =ŷ +ỹ by representing it as normal ( ŷ ) and anomalous subspace ( ỹ ). The concept of subspace anomaly detection uses an anomalous shift in the feature correlations in a vector function to represent an increment in the projection of that data point to an anomalous subspace. The magnitude of ỹ is higher than what is seen among normal instances [28]. To determine the magnitude of the projection of each instance to an anomalous subspace, we first examine the set of principal components in the normal subspace as columns of the matrix P of size m × r.
The PP T y matrix represents the linear projection to the normal subspace. C represents a linear projection to an anomalous subspace, and abnormal variations ỹ represent an anomaly. The square prediction error (SPE) is used to monitor the changes in ỹ . SPE is computed as shown in [29]: If an instance of SPE is more than a threshold, this instance is identified as anomalous, and in our case, it will be detected as an anomaly to the class that the instance was labeled as. Each new input is analyzed for anomalies. The anomaly detection algorithm combined with a normalized reconstruction error computes its projection on eigenvectors. The normalized error is used as the score for the anomaly, and we presuppose that the higher the error value, the more anomalous the instance [30,31]. This intuitive assumption also holds true for the ICA and autoencoder algorithms.

ICA
ICA is a statistical technique for analyzing hidden factors (sources or features) from a collection of measurements or records. The algorithm is able to separate the sources according to the distribution of the data. To achieve the separation of mixed data into independent components, ICA exploits the independence of the sources. This technique is useful for separating anomalies [32][33][34]. To formally describe the ICA model, consider X = (x 1 , x 2 , ..., x n ) as the input vector and S = (s 1 , s 2 , ..., s p ) as a random vector of latent mutually independent sources where p ≤ n . The ICA model can be described as ICA involves the calculation of both A and S where only X is known, with A being a matrix (n × p) . A is presumed to be fixed, but unknown, and finds a matrix W such that S = WX . ICA obtains S based on the fact that members of this vector are statistically independent and non-Gaussian random variables [13,35]. Ideally W −1 should be equal to A. However, it differs from A due to noise and outliers in mixed results. ICA methods define transformations such that components derived from mixtures are as independent as possible by optimizing or decreasing any objective function (kurtosis, entropy, etc.).

Autoencoder
In the field of neural networks, the concept of autoencoders has been popular for decades, with the most typical applications being the reduction of dimensionality or feature learning. More recently, the autoencoder concept has been used to learn generative data models and identify data anomalies [10,36]. An autoencoder is an artificial neural network used in an unsupervised fashion to learn data representation [37]. The simplest type of autoencoder is comparable to a simple multilayer perceptron (MLP) , with an input layer, an output layer and one or several hidden layers connecting them. The output layer has the same number of nodes (neurons) as the input layer in order to recreate its inputs. An autoencoder can be linear or non-linear, depending on its activation function [38].
The purpose of an autoencoder is to learn a representation (encoding) while disregarding signal noise. The algorithm attempts to minimize the difference between the input and the output, instead of predicting the target value Y given inputs X. A reconstructing side is learned along with the reduction side, where the autoencoder attempts to produce a representation as similar as possible to its original input from the reduced encoding. The size of the hidden layer determines the size of the latent representation and consists of two main components: an encoder that maps the input to the code and a decoder that maps the code to the original input [39]. The autoencoder reconstructs normal data efficiently after training, whereas it struggles to do so with anomalous data that was not encountered during training [40]. For high dimensional data with class anomalies, nonlinear autoencoders have shown superior classification performance [41]. In this study, we use non-linear autoencoders.
To detect label noise in a dataset, the autoencoder is trained with clean data and learns the normal behavior of this data. The weight and architecture of the autoencoder are adjusted to minimize the reconstruction error for clean instances of the class. For our training dataset, each record can be described as Z = Z 1 , Z 2 , ..., Z n . In this research, we inject various levels of noise into the test portion of the dataset by changing the label of a subset of the instances from one class to another. Reconstructed data points can be described as [24]. W 1 and W 2 represent the weight matrix, while b 1 and b 2 are biases for encoder and decoder, and f(x) and g(x) are tanh activation functions. The loss function based on reconstruction error is described as follows:

PCA vs Autoencoder
For PCA modeling, several algorithms are available. One such algorithm involves estimation by minimizing reconstruction error [32]. In this subsection, we compare a linear single-layer autoencoder with PCA (minimizing reconstruction error method) for simplicity. Ranjan's depiction [41] of a linear single-layer autoencoder is shown in Fig. 1.
As shown at the bottom of the diagram, the process of encoding is identical to the transformation of principal components (PCs) . Likewise, the method of decoding is similar to the reconstruction of data from the principal scores. For both the autoencoder and PCA , model weights are calculated through the minimization of the reconstruction error.
Suppose we have a dataset with p features and an encoding layer of size k. With PCA, k denotes the number of PCs chosen. For both the autoencoder and PCA , dimension reduction occurs when k < p . An over-representative model exists when k ≥ p , which indicates a reconstruction error of virtually zero.
In the encoding layer of the autoencoder in Fig. 1, a colored cell is a computing node with p weights denoted as W ij where i = 1, ..., k and j = 1, ..., p . For each encoding node in 1, ..., k, there is a p-dimensional weight vector. This is equivalent to an eigenvector in PCA . The encoding layer output for an autoencoder is given as z = g(Wx) = Wx , if the activation function g is linear, where x is the input and W is the weight matrix [41]. If g(Wx) is linear, this is the equivalent of the principal scores in PCA.
For autoencoder and PCA reconstruction, the size of the decoding layer must be the size of the input data p. In a decoder, the data is reconstructed from the encodings Fig. 1 Single layer autoencoder vs PCA [41] and can be represented as x = g(Ẃ z) =Ẃ z , if the activation function g is linear [41]. Similarly, for PCA , the data is reconstructed as x = W T z.

Tomek links
In this study, we also incorporated Tomek links, which is a traditional label noise detection technique. The Tomek links algorithm acts as a label noise filter and is denoted by pairs of instances. A Tomek link is a pair of data points x and y from different classes, such that, if d stands for the distance metric, there exists no example z such that d(x, z) is lower than d(x, y), or d(y, z) is lower than d(x, y). Hence, where the two examples x and y form a Tomek link, either one is noise or both are borderline [42]. These two examples are thus eliminated from the training data. To elaborate further, in a binary classification environment with classes 0 and 1, a Tomek link pair would have an instance of each class and would be nearest neighbors across the dataset [43]. These cross-class pairs are valuable in defining the class boundary [44]. Figure 2 by Argawal [45] shows an alignment of Tomek link pairs at the class boundary. It is important to note that the use of Tomek links for label noise detection does not involve the calculation of reconstruction error.

Development of Models
We use a credit card fraud dataset [46] that was collected and analyzed through a research partnership between Worldline and the Université Libre de Bruxelles (ULB). The original dataset contains 248,807 instances and 31 columns (class label included). The credit card purchases were made by European cardholders in September 2013. There are 492 cases of fraud, or 0.172%, making the dataset highly imbalanced. A high imbalance exists within a dataset if the majority-to-minority class ratio ranges from 100:1 to 10,000:1 [47]. All but two of the independent variables in the fraud dataset are principal components obtained by PCA . The label is 1 in the event of fraud and 0 otherwise. For training purposes, the original dataset was split into normal and fraud sub-datasets, and we subsequently trained the models using clean sub-datasets. We introduced Fig. 2 Tomek link pairs [45] noise by changing the labels for a number of instances in each class. Since the original dataset is imbalanced, the ratio of noisy to clean instances for the normal sub-dataset is also imbalanced. If, for example, we changed the label for all fraud cases from 1 to 0, it means we added 492 cases of label noise to the normal sub-dataset. In this case, the noise ratio is still less than 1% for this sub-dataset.
Our approach entailed k-fold cross-validation (CV) , where the model is trained on k-1 folds each time and tested on the remaining fold. This is to ensure that as much data as possible are used in the classification. More specifically, we use stratified CV which attempts to ensure that each class is approximately equally represented across each fold [48]. In our study, we assigned a value of 4 to k: three folds for training and one fold for testing. We repeated the process of building and evaluating the models 10 times for each unsupervised learner and dataset. The use of repeats helps to reduce bias due to bad random draws when generating the samples. The final performance result is the average over all 10 repeats. The primary focus of our work is to detect fraudulent instances mislabeled as normal instances (minority class instances mislabeled as the majority class). The PCA configuration used in this work has the same number of principal components as the number of original features from our dataset. The inverse transform function from Scikit-Learn recreated the original transactions from the key components produced [49].
For our ICA algorithm, we implemented the sklearn.decomposition.FastICA function from Scikit-learn [50]. The FastICA algorithm is an efficient and popular variant of ICA . Based on best performance during preliminary experimentation, we set the algorithm on 'parallel' mode, and set the max_iteration count to 200. Additionally, we also set the number of components equal to the input vector dimension size [50].
Our autoencoder was implemented in Keras using a TensorFlow backend [51]. Through preliminary experimentation, the best parameters were selected. The model contained nine densely connected hidden layers, along with the tanh activation function and L1 regularization. The input and output layers had the same dimensions as the number of features in the original dataset.
To identify an error threshold for each learner that discriminates between normal and anomalous behavior, we examined the distributions of the reconstruction errors. We then chose the nth percentile of the error distribution of the normal dataset [10].

100
× P + NL 10 If, for example, n = 95 , this means that 95% of the errors in the normal dataset are smaller than a value (v). Hence, any data point fed to a specific learner that generates an error greater than v would be classified as anomalous. Our analysis showed that the best outcomes were achieved at higher thresholds, i.e. n ≥ 93.

Metrics
Our work records the confusion matrix (Table 1) for a binary classification problem, where the class of interest is usually the minority class and the opposite class is the majority class, i.e. positives and negatives, respectively. A related list of simple performance metrics [52] is explained as follows: • True Positive (TP) is the number of positive samples correctly identified as positive.
• True Negative (TN) is the number of negative samples correctly identified as negative. • False Positive (FP) , also known as Type I error, is the number of negative instances incorrectly identified as positive. • False Negative (FN) , also known as Type II error, is the number of positive instances incorrectly identified as negative.
Based on these fundamental metrics, other performance metrics are derived as follows: • Recall, also known as True Positive Rate (TPR) or sensitivity, is equal to TP/(TP + FN).
• Specificity, also known as True Negative Rate (TNR) , is equal to TN/(TN + FP).
• Fall-out, also known as False Positive Rate (FPR) , is equal to FP/(TN + FP).
• Miss Rate, also known as FNR , is equal to FN/(TP + FN).
• AUC graphically shows recall versus (1-specificity), or TPR vs FPR , across all classifier decision thresholds [53]. From this curve, the AUC obtained is a single value that ranges from 0 to 1, with a perfect classifier having a value of 1.
In our study, we use more than one performance metric (recall, FNR , AUC ). This strategy allows us to better understand the challenge of evaluating the machine learning algorithms with highly imbalanced data.

Results and discussion
We expect to observe higher reconstruction errors for the mislabeled instances. Figures 3, 4, 5, 6, 7 and 8 are histograms that show error distribution for both clean and noisy data points for a noise level of 90, which corresponds to 452 noisy instances. These graphs plot the number of instances against reconstruction error. Tomek links are excluded for these figures because this algorithm does not rely on reconstruction error calculations for label noise detection. As indicated in Figs. 3, 4, 5, 6, 7 and 8, the distribution of reconstruction error is higher for the noisy instances than the clean instances.   Performance results for label noise detection are presented as average scores in Table 2. Each value is the mean of 40 repetitions (10 runs per experiment x 4-fold cross validation). The best performance score within each column for each algorithm is in italic type.
When comparing the same noise levels (same number of noisy instances) among the algorithms for each metric in Table 2, we see that the autoencoder has the best scores. Overall, the highest scores for AUC (0.96) and recall (0.90) were obtained with the autoencoder, while the lowest FNR score (0.10) was also obtained with this model.   Most tellingly, the Tomek links algorithm appears to be the worst performer. In general, we observe that for each algorithm, per metric, the highest noise level yielded the best score, and the lowest noise level yielded the worst score. The autoencoder model, for example, obtained its highest recall score of 0.90 for 452 noisy instances, and its lowest recall score of 0.82 for 51 noisy instances. This occurs because the algorithms become more efficient at detecting the class of interest (noisy instances) as the noise level increases. For a machine learning algorithm, the task of detecting a target class containing an extremely low number of instances is comparable to searching for a needle in a haystack [54]. We note that a few cases in Table 2 go against the norm, such as the best recall score for Tomek links (0.62), which is associated with the lowest noise level of 51 instances.
From a statistical point of view, it is beneficial to understand the significance of the performance scores in Table 2. Hence, we conduct ANalysis Of VAriance (ANOVA) tests to determine the impact of the algorithms on performance in terms of recall, FNR and AUC . ANOVA is a statistical test determining whether there is significant difference between group means [55]. We use a 95% ( α = 0.05) confidence level for the ANOVA tests shown in Tables 3, 4 and 5, where Df is the degrees of freedom, Sum Sq is the sum of squares, Mean Sq is the mean sum of squares, F value is the F-statistic, and Pr(>F) is the p-value.  The algorithms are the factor of concern for our study. Because this factor has a p-value of less than 0.05 in all our ANOVA tables, this indicates that algorithms are a significant factor for all our metrics. Therefore, we perform Tukey's Honestly Significant Difference (HSD) tests to determine which groups are significantly different from each other. Letter groups assigned via the Tukey method indicate similarity or significant differences in performance results within a factor [56]. For all metrics, the letter grades reported in Table 6 corroborate our earlier findings. Results show that autoencoders are in group 'a' , which is the top group, while Tomek links are in groups 'c' and 'd' , which are the worst-performing groups.
In essence, this study shows that our autoencoder model can efficiently detect label noise in imbalanced large data. The model can also be successfully used to detect label noise in imbalanced big data. This is because non-linear autoencoders are capable of modeling complex functions by compressing data to lower dimensions. Detecting label noise in big data is important due to the unique challenges posed by this type of data. Such challenges arise from the volume, variety, velocity, variability, complexity, and value of big data [47].
It should be noted that the computational cost of training the encoder is relatively low, since for each class, the autoencoder needs to be trained only once. In addition, the comparatively high scores for recall and AUC , as well as the comparatively low scores for FNR , are a testament to the robust nature of our autoencoder model.

Conclusion
In this paper, we propose a novel and effective method to deal with the label noise problem. Our method considers label noise as anomalous instances in the class where they have been mislabeled. As a first step, we trained unsupervised learners (autoencoders, ICA , PCA ) using clean data for each class. The trained models were then used to reconstruct unseen portions of data containing different levels of label noise, with higher reconstruction errors being produced for instances that were incongruent with their assigned class. The best detection rates for label noise were obtained by the autoencoders. Performance-wise, when the unsupervised learners were compared with Tomek links, we discovered that this traditional algorithm was the worst performer. Our proposed model is a potential solution for detecting label noise in imbalanced big data. The first-rate performance of our non-linear autoencoder is due to the low computational costs involved in training and the algorithm's ability to successfully model complex functions through dimension reduction. Future work will include additional performance metrics and also datasets from different application domains.