The effect of feature extraction and data sampling on credit card fraud detection

Training a machine learning algorithm on a class-imbalanced dataset can be a difficult task, a process that could prove even more challenging under conditions of high dimensionality. Feature extraction and data sampling are among the most popular preprocessing techniques. Feature extraction is used to derive a richer set of reduced dataset features, while data sampling is used to mitigate class imbalance. In this paper, we investigate these two preprocessing techniques, using a credit card fraud dataset and four ensemble classifiers (Random Forest, CatBoost, LightGBM, and XGBoost). Within the context of feature extraction, the Principal Component Analysis (PCA) and Convolutional Autoencoder (CAE) methods are evaluated. With regard to data sampling, the Random Undersampling (RUS), Synthetic Minority Oversampling Technique (SMOTE), and SMOTE Tomek methods are evaluated. The F1 score and Area Under the Receiver Operating Characteristic Curve (AUC) metrics serve as measures of classification performance. Our results show that the implementation of the RUS method followed by the CAE method leads to the best performance for credit card fraud detection.


Introduction
An unequal distribution of classes in a dataset is known as class imbalance. Under this condition, the majority class can overburden machine learning algorithms, thus making recognition of the minority class more challenging. As a result, classification performance scores for the impacted algorithms can become biased in favor of the dominant class. Random Undersampling (RUS) [1], Synthetic Minority Oversampling Technique (SMOTE) [2], and SMOTE Tomek [3] are examples of data-level approaches for dealing with class imbalance. Algorithm-level approaches that address this imbalance typically involve various cost-sensitive strategies [4]. In this study, the focus is on data-level approaches that mitigate class imbalance.
The data sampling technique of RUS discards members of the majority class until the ratio of instances of majority and minority class members reaches a predetermined level. SMOTE is a data augmentation technique that is used to increase the representation of the minority class in datasets. This technique synthesizes components of the minority class based on those that already exist and are proximate to each other. SMOTE Tomek combines the ability of SMOTE, which boosts instances of the minority class, and Tomek links [5], which removes instances of the majority class that have been identified as Tomek connections. Tomek links is further described in the "Background information" section, as this technique is not as well-known as SMOTE and RUS.
Feature extraction begins with a dataset that contains the original features and uses them to generate derived features which are designed to be informative and nonredundant. This process facilitates generalization and may improve interpretation and classification performance scores. Feature extraction generally leads to dimensionality reduction. As depicted in Figure 1, the original features of a dataset are transformed into a reduced set of features. Before training classifiers on largescale imbalanced datasets, feature extraction or dimensionality reduction is often performed. Feature extraction is carried out with various algorithms, such as Principal Component Analysis (PCA) [6] and Convolutional Autoencoders (CAEs) [7]. PCA is based on linear transformations, while autoencoders use non-linear complex functions. The feature extraction techniques used in this study are further described in the "Background information" section.
Our motivation for this work comes from the fact that there are yearly increases in the number of credit card fraud incidents [9] and that machine learning techniques have been successfully used to detect fraudulent activity. Our paper examines the use of feature extraction and data sampling on a class-imbalanced dataset. To be more specific, our research involves the evaluation of PCA and CAE techniques for feature extraction, with RUS, SMOTE, and SMOTE Tomek used as the data sampling techniques. To the best of our knowledge, this is the first study that investigates the use of PCA, CAE, RUS, SMOTE, and SMOTE Tomek on classimbalanced data. Our research uses a credit card fraud detection dataset from the Kaggle [10] community, aptly named the Credit Card Fraud Detection Dataset. Given the solid performance of ensemble classifiers in many studies [11][12][13], we use four ensemble learners based on the Decision Tree [14] classifier: Random Forest [15], XGBoost [16], LightGBM [17], and CatBoost [18]. Classification performance is measured with the F1 score and Area Under the Receiver Operating Characteristic Curve (AUC) metric.
The contribution of our research is highlighted as follows: • Examines effect of PCA and CAE on ensemble classifiers • Examines effect of RUS, SMOTE, and SMOTE Tomek on ensemble classifiers • Examines effect of the order of preprocessing tasks on ensemble classifiers The remainder of this paper is organized as follows: the "Background information" section provides background information on the Tomek links, PCA, and CAE algorithms; the "Related work" section reviews relevant Bot-IoT literature; the "Methodology" section covers data preprocessing and classification tasks; the "Results and discussion" section provides and analyzes our findings; and the "Conclusion" section summarizes the key points of this paper, as well as providing suggestions for future work.

Background information
The Tomek links algorithm acts as a label noise filter and is denoted by pairs of instances. A Tomek link is a pair of data points x and y from different classes, such that, if d stands for the distance metric, there exists no example z such that d(x,z) is lower than d(x,y), or d(y,z) is lower than d(x,y). Hence, where the two examples x and y form a Tomek link, either one is noise or both are borderline. These two examples are thus eliminated from the training data. To elaborate further, in a binary classification environment with classes 0 and 1, a Tomek link pair would have an instance of each class and would be nearest neighbors across the dataset [19]. These cross-class pairs are valuable in defining the class boundary [20]. Figure 2 shows an alignment of Tomek link pairs at the class boundary. It is important to note that the use of Tomek links for label noise detection does not involve the calculation of reconstruction error.
The PCA algorithm is a feature extraction technique with a variety of applications in exploratory data analysis, visualization, and dimensionality reduction [21]. It is an unsupervised algorithm that generates a linear mixture of original features and new features which are not correlated with the original features. Furthermore, generated features are ranked according to the amount of variance that can be explained by them. As a result, Principal Component 1 represents the first principal that explains the greatest amount of variance in the dataset, Principal Component 2 represents the second principal that explains the second greatest amount of variance in the dataset, and so on. It is therefore possible to minimize the dimensionality of data with principal components.
An autoencoder is a neural network architecture that tries to learn a compact or latent representation of an input. This latent representation contains the extracted features.  2 Tomek link pairs [8] The autoencoder is frequently part of a larger model that tries to recreate the input. Despite the fact that an autoencoder is an unsupervised learning method, it is technically trained via supervised learning and hence can be considered a type of semi-supervised learning. Figure 3 illustrates the structure of a typical autoencoder, which is a feed-forward neural network containing an encoder, one or more hidden layers, and a decoder. The encoder feeds information from the input into the hidden layer, and the decoder feeds information from the hidden layer into the output layer. It is assumed that an autoencoder model will reconstruct the identical inputs that flowed through the input layer during the training process. Consequently, the decoder acts as a mirror image of the encoder, with a matching number of neurons to the encoder in both directions. For feature extraction and dimensionality reduction, the smallest hidden layer in the architecture (also referred to as the bottleneck) is used to compress the input to the lowest level of space (also known as latent space) in order to achieve the desired dimensionality reduction [22]. During the training phase, the decoder is used to calculate the error rate of the model, but it is not utilized to recover the original input dimension of the data. Several distinct types of autoencoders are available, and their uses range widely.
The CAE has a similar architecture to the Convolutional Neural Network (CNN) [7]. Both algorithms use some of the same fundamental components, including convolutional filters and pooling layers [24]. The encoder performs feature extraction and dimensionality reduction by using the convolution filters and pooling layers of the CNN. The decoder performs the reverse operation. Figure 4 shows the structure of a typical CAE.

Related work
The reduction of high-dimensional data, such as genomic information, images, videos, and text, is seen as an important and necessary data preprocessing step that generates high-level representations. One reason for reducing dimensionality is to provide deeper insight into the inherent structure of data. Various feature extraction techniques have Fig. 3 Autoencoder architecture [23] been explored. The early approaches are based on projection and involve mapping input features in the original high-dimensional space to a new low-dimensional space while minimizing information loss [26]. PCA and Linear Discriminant Analysis (LDA) [27] are two of the most well-known projection techniques. The former is an unsupervised method that maximizes variance to project original data into its principal directions. The latter is a supervised approach for locating a linear subspace by optimizing distinguishing data between classes. The main disadvantage of these approaches is that they conduct linear projection. Subsequent research overcame this problem by utilizing nonlinear methods. Another drawback of the early approaches is that the majority of these works tend to map data from high-dimensional to low-dimensional space by extracting features once, rather than stacking them to build deeper levels of representation progressively [22]. Autoencoders compress dimensionality by minimizing reconstruction loss using artificial neural networks. As a result, it is simple to stack autoencoders by adding hidden layers. This gives the autoencoder and its variants, such as the CAE, the ability to extract meaningful features.
Compared to the plain autoencoder, the CAE has the ability to extract smooth features by use of its pooling layers, which is advantageous for classification. Polic et al. [28] employ a CAE to reduce the optical-based output for a tactile sensor image. The authors validate their method with a set of benchmarking cases. Shallow neural networks and other machine learning models are used to estimate contact object shape, edge position, orientation, and indentation depth. A contact force estimator [29] is also trained, resulting in the confirmation that the extracted features contain sufficient information on both the spatial and mechanical properties of the object.
Meng et al. [22] note that the plain autoencoder fails to take into account relationships between data features. These relationships may impact results if original and/ or novel features are used. For feature extraction, Meng et al. propose a relational autoencoder model that factors in both data features and their relationships. The authors also make their model compatible with other autoencoder variants, such as a sparse autoencoder [30], denoising autoencoder [31], and variational autoencoder [32]. Upon testing the proposed model on a set of benchmark datasets, results show that the incorporation of data relationships generates more robust features with lower reconstruction error loss, when compared to the other autoencoder variants.
In another related work, Lee et al. [33] use a CAE to perform feature extraction and dimensionality reduction for radar data analysis. The aim of their study is to obtain a  [25] fast, accurate, and human-like image-processing algorithm. Finally, Maggipinto et al. [7] use a CAE to extract features in data-driven applications for virtual metrology. Values for optical emission spectrometry serve as the input data.
Finally, we investigated autoencoder studies performed on the Credit Card Fraud Detection Dataset published by Kaggle. The relevant works are described in the following two paragraphs.
Using both a plain autoencoder algorithm and a Logistic Regression algorithm, Al-Shabi [34] evaluated balanced and imbalanced data to detect credit card fraud in the dataset. Results show that the autoencoder outperformed Logistic Regression. However, we note that the F1 score for the autoencoder is only 0.04, due to the low value for the Precision metric.
Within the framework of one-class classification, Chen et al. [35] combined a sparse autoencoder with a Generative Adversarial Network to detect credit card fraud in the dataset. Other one-class classification algorithms were evaluated, namely OneClass Gaussian Process [36] and Support Vector Data Description [37]. Based on the results, the authors' proposed model, with a top F1 score of 0.8736, performed the best. The reproducibility of their work is questionable, since the hyperparameters used for the One-Class Gaussian Process and Support Vector Data Description algorithms have not been provided.
With regard to the final two related works, we point out that their best values for F1 score are noticeably lower than our best value obtained in this study. Further, we note that none of the related works discussed in this section use data sampling in conjunction with feature extraction.

Methodology
The Credit Card Fraud Detection Dataset [10] was published by Worldline and the Universit´e Libre de Bruxelles (ULB). There are 284,807 instances and 30 independent variables in the raw dataset, which shows credit card purchases by Europeans in September 2013. The label (dependent variable) of this binary dataset is 1 for a fraudulent transaction and 0 for a non-fraudulent transaction. Fraudulent transactions constitute 492 instances, or 0.172%, thus rendering the dataset highly imbalanced with regard to the majority and minority classes.
In this study, we evaluate three data sampling techniques (RUS, SMOTE, and SMOTE Tomek) and two feature extraction techniques (PCA and CAE). The impact on classifier performance is evaluated with various scenarios, as depicted in Table 1.
The RUS technique is used to obtain five different ratios (1:1, 1:5, 1:10, 1:20 and 1:50), as shown in Tables 2, 3, 4, 5, 6. These ratios represent the minority to majority instances for each dataset obtained by down-sampling the original dataset. The use of this range of ratios strengthens the validity of our study. The SMOTE and SMOTE Tomek data sampling techniques are associated with Tables 7 and 8, respectively. Table 9 shows the results where no preprocessing was performed, i.e., no data sampling and no feature extraction.
The SMOTE, SMOTE Tomek, and RUS algorithms are included in the imbalancedlearn [38] Python library. For SMOTE and SMOTE Tomek, we set the k neighbors parameter to 5. After the implementation of the PCA algorithm with ScikitLearn [39],         implemented with Keras and TensorFlow, with optimum parameters selected during preliminary experimentation [40].
The learners used in this study are Random Forest, XGBoost, LightGBM, and Cat-Boost. Random Forest, which is an ensemble of Decision Trees, uses the bagging [41] technique. XGBoost, LightGBM, and CatBoost are Gradient-Boosted Decision Trees (GBDTs) [42], which are ensembles of Decision Trees that are trained sequentially with the boosting [43] technique. XGBoost is based on a weighted quantile sketch and a sparsity-aware function. A weighted quantile sketch uses approximate tree learning [44] for merging and pruning operations, while sparsity is concerned with zero or missing values. LightGBM is defined by Exclusive Feature Bundling and Gradient-based One-Side Sampling. Exclusive Feature Bundling reduces the count of variables through the categorization of mutually exclusive features, while One-Side Sampling excludes a chunk of instances associated with small gradients. CatBoost is designed around Ordered Boosting, an algorithm that orders instances used by Decision Trees.
Training and testing are performed with k-fold cross-validation, where the model is trained on k-1 folds each time and tested on the remaining fold. This ensures that as much data as possible is used during the classification phase. Our crossvalidation process is stratified, which seeks to ensure that each class is proportionally represented across the folds. In this experiment, a value of five was assigned to k: four folds used in training and one fold used in testing. The process was repeated five times.
The AUC metric is used to measure classifier performance. AUC refers to the area under the Receiver Operating Characteristic (ROC) curve, which plots True Positive Rate (TPR) against False Positive Rate (FPR). AUC summarizes overall model performance and is reflective of all classification thresholds along the curve [45]. The F1 score metric is the harmonic mean of precision and recall. Like AUC, the F1 score is well-suited for datasets with a high class imbalance [46]. For the F1 score, the default threshold of 0.5 was used. Tables 8, 9 show performance scores obtained for the F1 Score and AUC metrics. In each table, each row provides results for a particular combination of data sampling and feature extraction technique. Table 9 reflects the baseline scores, where no preprocessing activity (no data sampling and no feature extraction) was implemented. The highest values of 0.853 and 0.891 are for the F1 score and AUC, respectively. These two scores were obtained with CatBoost.

Results and discussion
The highest values among the tabulated results obtained through the RUS sampling technique are in Table 6, which is associated with a minority-to-majority class ratio of 1:50. For the F1 score and AUC, the highest values in this table are 0.909 and 0.988, respectively. The score of 0.909 was obtained by LightGBM, while the score of 0.988 was obtained by XGBoost, LightGBM, and CatBoost, all GBDTs. Interestingly, the highest F1 score for the baseline (Table 9) is greater than any of the F1 scores for the RUS ratios of 1:1, 1:5, and 1:10 (Tables 2, 3, and 4, respectively). Table 7 was obtained with the SMOTE sampling technique. The highest values in this table for the F1 score and AUC are 0.872 and 0.940, respectively. The score of 0.872 was obtained by XGBoost, while the score of 0.940 was obtained by LightGBM and CatBoost.
In Table 8, which was obtained with the SMOTE Tomek sampling technique, the highest values of 0.899 and 0.970 are associated with the F1 score and AUC, respectively. The score of 0.899 was obtained by Random Forest, while the score of 0.970 was obtained by Random Forest and XGBoost.
To determine the statistical significance of the performance scores, we perform threeway ANalysis Of VAriance (ANOVA) tests. ANOVA reveals whether there is a significant difference between the group means [47]. A 95% (α = 0.05) confidence level is used for our ANOVA tests. The results are shown in Tables 10 and 11 for the F1 score and AUC, respectively.
In these tables, Df is the degrees of freedom, Sum Sq is the sum of squares, Mean Sq is the mean sum of squares, F value is the F-statistic, and Pr(>F) is the p-value. Note that for the Sampling Technique factor, only the 1:50 minority-to-majority class ratio is considered for the RUS technique, since this ratio yields the highest performance scores among all the RUS ratios obtained. As shown in Tables 10 and 11, the p-value for each factor is practically 0, well below the level of α. Hence, we infer that all factors have a significant impact on performance in terms of AUC. Since this is the case, Tukey's Honestly Significant Difference (HSD) tests [48] are carried out to find out which groups are significantly different from each other. For a particular experiment, letter groups assigned through the Tukey method indicate similarity or significant differences in performance results within a factor.
The Tukey method is first applied within the scope of the F1 score metric. With regard to the Scenario factor (Table 12), data sampling alone is ranked in group 'a' , the best-performing group. Data sampling followed by CAE is ranked in group 'b' , the second-best performing group. The bottom group 'f ' consists of PCA followed by data sampling. In terms of the Classifier factor (Table 13), Random Forest, the top performer, is in group 'a' , XGBoost is in group 'b' , and at the bottom is LightGBM in group 'd' . For the Sampling factor (Table 14), RUS, the best performer, is in group 'a' , SMOTE Tomek is in group 'b' , and SMOTE is in group 'c' . The Tukey method is next applied within the scope of the AUC metric. With regard to the Scenario factor (Table 15), data sampling followed by CAE is ranked in group 'a' , the best-performing group. Data sampling alone is ranked in group 'b' , the second-best In terms of the Classifier factor (Table 16), CatBoost, the top performer, is in group 'a' , XGBoost is in group 'b' , and at the bottom is LightGBM in group 'd' . For the Sampling factor (Table 17), RUS, the best performer, is in group 'a' , SMOTE Tomek is in group 'b' , and SMOTE is in group 'c' . Based on the Tukey's HSD results for the F1 score and AUC metrics, the RUS technique is the clear-cut top choice for the Sampling factor. However, the choice of best classifier could not be established from the rankings. This is because the HSD results for the F1 score metric (Table 13) show Random Forest as the best classifier, while the HSD results for the AUC metric (Table 16) show CatBoost as the best classifier. The Scenario factor shows data sampling followed by CAE as the second-best choice for the F1 score metric (Table 12) and the best choice for the AUC metric (Table 15). Conversely, data sampling alone is the top choice for the F1 score metric and the second-best choice for the AUC metric. We recommend the use of data sampling followed by CAE for the Scenario factor. This is because the implementation of CAE, which is a feature extraction technique, tends to reduce computational burden and decrease the training time of machine learning algorithms. As stated earlier, feature extraction may also improve generalization and the interpretation of results. With regard to the Scenario factor for the F1 score and AUC, the following was observed: Sampling + CAE is better than Sampling + PCA; CAE + Sampling is better than PCA + Sampling; and CAE + None is better than PCA + None. We believe that CAE has an advantage over PCA because the autoencoder can model non-linear functions.

Conclusion
In this research, we use a credit fraud dataset to investigate the effect of data sampling and feature extraction on four ensemble classifiers. Three data sampling techniques, RUS, SMOTE, and SMOTE Tomek are evaluated, and two feature extraction techniques, PCA and CAE, are evaluated. The results indicate that the use of the RUS data sampling technique followed by the use of the CAE feature extraction technique yields the best results.