Skip to main content

Threshold optimization and random undersampling for imbalanced credit card data

Abstract

Output thresholding is well-suited for addressing class imbalance, since the technique does not increase dataset size, run the risk of discarding important instances, or modify an existing learner. Through the use of the Credit Card Fraud Detection Dataset, this study proposes a threshold optimization approach that factors in the constraint True Positive Rate (TPR) ≥ True Negative Rate (TNR). Our findings indicate that an increase of the Area Under the Precision–Recall Curve (AUPRC) score is associated with an improvement in threshold-based classification scores, while an increase of positive class prior probability causes optimal thresholds to increase. In addition, we discovered that best overall results for the selection of an optimal threshold are obtained without the use of Random Undersampling (RUS). Furthermore, with the exception of AUPRC, we established that the default threshold yields good performance scores at a balanced class ratio. Our evaluation of four threshold optimization techniques, eight threshold-dependent metrics, and two threshold-agnostic metrics defines the uniqueness of this research.

Introduction

Class imbalance within a dataset occurs when there is a higher number of instances in one or more classes than in the other class(es). From a binary class perspective, this imbalance means that there is one majority (typically negative) class and one minority (typically positive) class. If the difference between the number of majority and minority class instances is significant, as in the case of high class imbalance, the results of a machine learning study could be skewed. As stated by several researchers, a condition of high class imbalance exists when the minority-to-majority ratio ranges from 1:100 to 1:10,000 [1].

Various techniques are employed to reduce class imbalance or reduce the effect of this imbalance. These techniques can be implemented at the data or algorithm level or both. The most popular and established data-level approaches [1] involve Random Undersampling (RUS), Random Oversampling (ROS), and Synthetic Minority Oversampling Technique (SMOTE). However, these three techniques have well-known disadvantages. RUS, a method for randomly discarding instances from the majority class, may also remove important instances. ROS, a process for duplicating instances of the minority class, runs the risk of overfitting. Developed as an intelligent method for duplication of the minority class, SMOTE generates synthetic instances between existing instances of the minority class. The risk of overfitting is greatly reduced with SMOTE. One disadvantage, which is characteristic of any oversampling technique, is the increase in size of the dataset. Algorithm-level approaches include class-weighting and output thresholding techniques. Class-weighting, which is a direct algorithm-level method, modifies the learner [2]. It is a popular technique that has been integrated into many machine learning algorithms. Output thresholding is a process for tuning the decision threshold that is used to associate class labels with a model’s probability estimates [3]. While both algorithm-level approaches reduce bias toward the majority class, output thresholding is a more beneficial technique because it does not modify existing learners and can be performed on any learner that provides probability scores. Therefore, this paper focuses primarily on threshold optimization. In addition, six different levels of RUS are applied to evaluate the interaction between threshold optimization and changing class distributions. We use RUS because studies show that it performs as good as or better than other methods of addressing class imbalance in most cases [4, 5].

In this paper, threshold optimization is used to assign class labels to a model’s output probability scores. The optimal or best threshold is one that maximizes the score of a specified performance metric. A valuable tool in our study is the application of the constraint True Positive Rate (TPR) ≥ True Negative Rate (TNR). This constraint ensures that a threshold will not be selected where the positive class has been ignored by a classifier. For comparative purposes, the default threshold of 0.5 is also investigated to determine whether it is suitable for classifying imbalanced data. Since this is a comprehensive study, we investigate four threshold optimization techniques based on metrics: F-measure, Geometric Mean of TPR and TNR, Matthews Correlation Coefficient (MCC), and Precision. In addition, we evaluate eight threshold-dependent metrics: TPR, False Positive Rate (FPR), False Negative Rate (FNR), TNR, F-measure, Geometric Mean of TPR and TNR, MCC, and Precision. Also under evaluation are two threshold-agnostic metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and AUPRC. The learners in this work are XGBoost [6], CatBoost [7], Random Forest [8], Extremely Randomized Trees [9], and Logistic Regression [10].

Our research is centered on the Credit Card Fraud Detection Dataset, which is a set of anonymized transactions available for download from Kaggle [11]. The dataset is based on credit card purchases by Europeans in 2013. There are 284,807 instances and 30 independent variables in the Credit Card Fraud Detection Dataset. Other publicly available datasets used for credit card fraud detection are orders of magnitude less in size. Fraudulent transactions comprise 0.172% of the total number of records, which means that the dataset is highly imbalanced. We use this dataset because it consists of real-world transactions and also because it is on track to become a gold standard for credit card fraud detection.

Our research findings are highlighted as follows:

  • As the AUPRC score increases, the threshold-based performance scores also improve.

  • As RUS is used to increase the positive class prior probability, the optimal thresholds also increase.

  • Best overall results for the selection of an optimal threshold are obtained without the use of RUS.

  • For most metrics, the default threshold yields its best results at a balanced (1:1) class ratio.

  • However, the combination of the default threshold and balanced class ratio yields the lowest AUPRC scores for all classifiers, implying a significant tradeoff for balancing the classes.

  • The default threshold does not yield good results when the dataset is imbalanced.

To the best of our knowledge, this is the first study to investigate threshold optimization using four different techniques based on metrics, while considering the TPR ≥ TNR constraint. Moreover, this is the first study to evaluate threshold optimization with eight threshold-dependent and two threshold-agnostic metrics. The remainder of this paper is organized as follows: “Related work” section reviews relevant literature on output thresholding; “Data description” describes the dataset; “Methodology” section covers the methodology, learners, and non-default hyperparameters used; “Results and discussion” section presents and analyzes our findings; and “Conclusion” section summarizes the key points of this paper, as well as providing suggestions for future work.

Related work

The objective of this section is to discuss similar studies that use optimal thresholds for dataset classification. We did not come across any studies on the use of output thresholding with the Credit Card Fraud Detection Dataset.

In relation to one dataset of scenic images and another of health records, Zhang et al. [12] proposed the use of threshold moving techniques to address class imbalance. This involved the adjustment of decision thresholds for binary classification, so that the class distribution of training data could match the predicted outcomes of new data. Using a multi-label version of Random Forest, the authors then performed multi-label classification, where instances may belong to more than one label. Performance-wise, their results indicate that the Random Forest model was just as good or better than more complex multi-label classifiers. Both our work and theirs incorporate the positive class prior probability threshold. However, we go further by comparing the positive class probability threshold with other thresholds.

Buda et al. [13] investigated the effect of output thresholding on Convolutional Neural Networks (CNNs) [14], with the aid of three benchmark datasets: MNIST [15], CIFAR-10 [16], and ImageNet [17]. Subsampling was used to render the datasets sufficiently imbalanced. The authors showed that making the threshold equivalent to the positive class prior probability noticeably improved accuracy. Not only do we use the positive class prior probability technique in our work, but we also evaluate four threshold optimization techniques. In addition, the inclusion of eight performance metrics in our study makes it a more comprehensive paper.

With a focus on the network security domain, Calvert and Khoshgoftaar [18] used threshold optimization for establishing alternatives to the AUC metric. They determined that optimal thresholds could be obtained with the Geometric Mean and F-Measure metrics. Using these optimal thresholds, the authors were able to evaluate the performance of various classifiers with different metrics. We note that the authors do not assess the effect of optimized thresholds for Geometric Mean and F-Measure on eight metrics. Another contribution of our work is the inclusion of the TPR ≥ TNR constraint.

Finally, Zhou et al. [19] developed a method for finding optimal classification thresholds during experimentation with a protein homology dataset. Their technique involved the comparison of optimal thresholds against “uniform” thresholds of 0.1, 0.2, 0.3 and the default threshold of 0.5. According to their results, the optimal thresholds yielded better scores than the “uniform” thresholds. Our approach is more general, in that we assess several optimal thresholds obtained by different techniques. Moreover, we demonstrate how constraints can be effectively imposed on the threshold optimization process.

We discovered that many studies use RUS to address class imbalance [1]. As stated earlier, there is an inherent risk of discarding important instances when using RUS. This risk is non-existent for output thresholding. In concluding this section, we reaffirm that this is the first paper to include four threshold optimization techniques based on metrics, while taking into account the TPR ≥ TNR constraint. It is also the first paper to do so using eight threshold-dependent and two threshold-agnostic metrics.

Data description

The Credit Card Fraud dataset [11] was published by Worldline and the Université Libre de Bruxelles (ULB). There are 284,807 instances and 30 independent variables or input features in the raw dataset, which shows credit card purchases by Europeans in September 2013. Using Principal Component Analysis (PCA) [20], the dataset publishers transformed 28 of the 30 input features. The remaining two features, “Time” and “Amount” were not transformed. “Time” contains the seconds elapsed between each transaction and the first transaction in the dataset. “Amount”, which we normalized, is the transaction amount. “Time” was dropped as this is a uninformative feature for the purpose of the study.

The label (dependent variable) of this binary dataset is 1 for a fraudulent transaction and 0 for a non-fraudulent transaction. Fraudulent transactions constitute 492 instances, or 0.172%, thus making the dataset highly imbalanced with regard to the minority and majority classes.

Methodology

Experiments were run on a distributed computing platform where available nodes have Intel Xeon Central Processing Units (CPUs) with 16 cores, 256 GB RAM per CPU and Nvidia V100 GPUs. Our programs for training and testing machine learning models were implemented in the Python programming language. CatBoost and XGBoost are standalone Python libraries. Random Forest, Extremely Randomized Trees, and Logistic Regression are part of the Sci-kit Learn library [21]. These five learners represent different families of machine learning algorithms, thus benefiting the generalization of results.

CatBoost is designed around Ordered Boosting, an algorithm that orders instances used by Decision Trees. XGBoost is based on a weighted quantile sketch and a sparsity-aware function. A weighted quantile sketch uses approximate tree learning [22] for merging and pruning operations of Decision Trees, while the sparsity-aware function is an optimization that efficiently locates the best value to split a dataset on for a Decision Tree node when data is sparse. Random Forest is an ensemble of Decision Trees, and it uses the bagging [23] technique. The Extremely Randomized Trees learner, which also relies on the bagging technique, is an extension of Random Forest. However, for Random Forest, the optimal values for splits in the Decision Tree are usually calculated systematically, whereas these optimal values are selected randomly for Extremely Randomized Trees. Logistic Regression produces a value corresponding to the probability of belonging to a particular class. It is a linear model that relies on a sigmoid function to output a number between 0 and 1.

For every experiment, we select a classifier and a class ratio that we wish to apply to the training data. We use the Imblearn [24] library’s RandomUnderSampler module to control class ratios for all experiments, except for the case when we leave the class ratio at its initial value. Hence, we perform experiments with the six class ratios of 1:1 (balanced class ratio), 1:3, 1:9, 1:27, 1:81, and 1:578 (original class ratio of the Credit Card Fraud Detection dataset). Selection of the first five ratios is based on preliminary experimentation through RUS.

For each experiment, training is subsequently performed on 80% of the data using k-fold cross-validation, where the model is trained on k-1 folds each time and tested on the remaining fold. This ensures that as much data as possible is used during the classification phase. Our cross-validation process is stratified, which seeks to ensure that each class is proportionally represented across the folds. In this experiment, a value of five was assigned to k, where four folds were used in training and one fold was used in testing. We perform 10 iterations of cross-validation as a precaution against data loss due to random sampling of instances from the majority class.

The threshold optimization technique, as outlined in Algorithm 1, is implemented on the training data. For the remaining 20% of the data, classifier output probabilities are calculated. Instances of the test data are then assigned to classes based on the computed thresholds. After the model is fit to the training partitions, Algorithm 1 is used to identify the optimal decision threshold that maximizes some user-defined performance metric. Given the training data’s probability estimates and ground truth labels, we enumerate all possible thresholds and then select the threshold that maximizes the desired performance metric. This design is flexible, as it allows for the user to optimize against any performance metric that is suitable for their problem. It also allows for additional constraints, e.g., we desire that the TPR be greater than the TNR in many classification problems. Finally, we apply the optimal decision threshold that has been learned from the training partitions to the test partition and record the test performance.

figure a

We point out that our threshold optimization algorithm takes a function as an argument. This function parameter may be one of the four classification metrics: Geometric Mean of TPR and TNR, MCC, Precision, or F-measure. Furthermore, our threshold optimization algorithm has a flag that controls whether optimization is constrained. If the flag is set to true, the threshold, where TPR ≥ TNR, is chosen for the best value of a specific metric. Optimized thresholds are computed for all combinations of the constraint flag and classification metrics.

For each of the optimized thresholds, as well as the default threshold of 0.5 and the positive class prior probability threshold, scores are calculated for the following metrics: TPR, FPR, FNR, TNR, F-measure, Geometric Mean of TPR and TNR, MCC, and Precision. For Logistic Regression, the hyperparameter values were not changed. To prevent overfitting of the Decision Tree-based classifiers, the maximum tree depths shown in Table 1 are used. These depths were obtained from preliminary experimentation. Overall classification performance for each model at each ratio was evaluated with the threshold-agnostic AUC and AUPRC metrics.

Table 1 Maximum tree depths used in experiments

Results and discussion

To start off, we define the words and abbreviations used in the tables. “Classifier” is the type of learning algorithm used for classification. AUC is the Area under the Receiver Operating Characteristic Curve. AUPRC is the Area under the Precision–Recall Curve. “Technique” is the thresholding technique used. An example is the selection of the output probability threshold that optimizes F-measure. “Threshold” is the value of the threshold found for a particular thresholding technique. TPR stands for True Positive Rate. FPR stands for False Positive Rate. FNR stands for False Negative Rate. F-meas stands for the F-measure score. G-mean stands for the score of the Geometric Mean of TPR and TNR. MCC stands for Matthews Correlation Coefficient.

Under the “Technique” column, there are several abbreviations for the thresholding techniques. F-meas, G-mean and MCC have already been defined. NC stands for no constraint. In other words, the constraint has not been applied. The constraint is that the True Positive Rate is greater than the True Negative Rate. Therefore, “NC” after any thresholding technique means that the constraint is not used. C stands for class prior; i.e., the threshold value chosen is the fraction of positive instances in the dataset. Since the class prior threshold is determined, and not calculated via optimization code, the constraint that True Positive Rate be greater than True Negative Rate cannot be applied. D stands for the default threshold of 0.5. The constraint cannot be applied to the default threshold as well.

Several tables of results were generated for the various class ratios for each classifier. For ease of understanding, in this section we show the results of CatBoost, which is the top performing classifier overall. Results obtained with the remaining classifiers are shown in “Appendices”.

Classification results for the original class ratio (no RUS applied)

In Table 2, CatBoost is the best performer with regard to AUC and AUPRC, obtaining scores of 0.9834 and 0.8592, respectively. For Table 3, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. Furthermore, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts.

Table 2 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 3 Results for the CatBoost depth 5 classifier

Classification results for the 1:1 class ratio

In Table 4, Extremely Randomized Trees is the best performer in reference to AUC and AUPRC, obtaining scores of 0.9803 and 0.7379, respectively. For Table 5, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. In addition, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts.

Table 4 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 5 Results for the CatBoost depth 1 classifier

Classification results for the 1:3 class ratio

In Table 6, CatBoost generated the highest score for AUC (0.9790), while the top score for AUPRC (0.7481) was obtained with XGBoost. For Table 7, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. Also, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts.

Table 6 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 7 Results for the CatBoost depth 1 classifier

Classification results for the 1:9 class ratio

In Table 8, XGBoost produced the top scores for both AUC (0.9801) and AUPRC (0.7804). For Table 9, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. In addition, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts.

Table 8 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 9 Results for the CatBoost depth 1 classifier

Classification results for the 1:27 class ratio

In Table 10, CatBoost is the top performer for both AUC and AUPRC, registering scores of 0.9817 and 0.7963, respectively. For Table 11, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. Also, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts

Table 10 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 11 Results for the CatBoost depth 5 classifier

Classification results for the 1:81 class ratio

In Table 12, CatBoost is the top performer for both AUC (0.9832) and AUPRC (0.8490). For Table 13, use of the default threshold yields comparatively low TPR scores and comparatively high FNR scores. Also, the threshold values obtained using constrained optimal thresholds are lower than their non-constrained counterparts.

Table 12 Mean AUC and AUPRC scores for 10 iterations of fivefold cross validation
Table 13 Results for the CatBoost depth 5 classifier

Overall analysis of results

A thorough investigation of results should also involve a collective analysis of the various experiments performed. As such, interesting observations about the overall study have been discussed in the paragraphs below.

As RUS is used to increase the positive class prior probability, the AUC metric remains practically unchanged. For the original class ratio, the highest score in Table 2 is 0.9834 (CatBoost). At the balanced class ratio, the highest score in Table 4 is 0.9803 (Extremely Randomized Trees). In terms of AUPRC, the increase of positive class prior probability negatively affects the scores. At the original class ratio, the highest score in Table 2 is 0.8592 (CatBoost), which is also the highest value of AUPRC for the entire study. At the balanced class ratio, the highest score in Table 4 is 0.7379 (Extremely Randomized Trees). AUPRC is impacted due to the sensitivity of this metric to the percentage of positive class instances. When RUS is used to balance training data, the models become less and less biased toward the majority class. Instead, they begin to favor the minority class more, causing them to over-predict the minority class and obtain many false positives. These large numbers of false positives are detrimental to Precision and AUPRC.

Increasing model bias toward the minority class may result in a higher TPR score on the test set. This action can significantly lower TNR scores, which in turn negatively impacts Precision and F-Measure. In general, the increase in bias results in lower G-Mean, F-Measure, MCC, and Precision scores (all metrics that take both classes into consideration). This is because overall, the models trained with RUS are worse at discriminating between the classes, as observed with the AUPRC results. Some may argue that RUS should be used to obtain higher TPR scores. Based on our results, however, we counterargue that instead of applying RUS to raise TPR scores, the threshold optimization process should be used. This optimization would yield better results, because a model trained without RUS is better at discriminating between classes.

Since the AUPRC indicates how good a model is at separating classes, an increase in the AUPRC score means that, in theory, there will be an improvement in threshold-based performance. Hence, the use of AUPRC and optimal thresholding are two techniques that complement each other. We point out that the AUPRC does not identify a specific threshold for selection. Rather, the AUPRC is helpful because it allows for the selection of the best model, model hyperparameters, or sampling rates. Within this framework, optimal thresholding should be implemented after the AUPRC is plotted in order to select the correct operating point on the curve. It is intuitively apparent that the best AUPRC values occur when RUS is not applied (at the original class ratio). Therefore, in the absence of RUS, the practice of optimal thresholding can be used to identify the ideal operating point on an AUPRC curve. This does not mean that the use of RUS should be avoided at all costs, as there is always a run-time tradeoff to consider. For example, undersampling to a 1:81 class ratio may yield statistically similar results, which translates to the saving of training resources in the case of big data.

An increase in positive class prior probability with RUS generally increases the optimal thresholds. This phenomenon can be observed using results obtained with CatBoost (and also results shown in “Appendices” for the remaining classifiers). For the original class ratio, Table 3 shows the constrained optimal threshold for Geometric Mean as 0.0015, while for the balanced class ratio in Table 5, the related value is 0.2992. This phenomenon is due to the shift of the optimal thresholds toward the positive class ratio.

Finally, it should be noted that the best results for the default threshold are obtained at the balanced class ratio. This observation can be shown using results obtained with CatBoost (and also results shown in “Appendices” for the remaining classifiers). For the original class ratio, Table 3 associates the default threshold with a Geometric Mean score of 0.8905. For the balanced class ratio, the related value in Table 5 is 0.9375. This observation can be attributed to the shifting of probability scores closer to the value of 0.5 when RUS is used to obtain the 1:1 class ratio. We also point out that there is a comparatively significant difference between the Geometric Mean score of 0.9375 at the balanced class ratio and 0.8905 at the original class ratio. For the 1:81, 1:27, 1:9, and 1:3 ratios, the related Geometric Mean scores for Logistic Regression at the default threshold are 0.9126, 0.9229, 0.9259, and 0.9349, respectively. Hence, it should also be noted that the default threshold does not yield strong results when the dataset is imbalanced.

Conclusion

In our research, we use the Credit Card Fraud Detection Dataset to investigate output thresholding. A useful aid in this work is the application of the constraint TPR ≥ TNR, which ensures that the positive class is never ignored during the selection of an optimal threshold. We evaluate four threshold optimization techniques, eight threshold-dependent metrics, and two threshold-agnostic metrics.

Our primary observations made in this research suggest that: an increase of the AUPRC score is associated with an improvement of threshold-based performance scores; increasing the positive class prior probability will increase optimal thresholds; best overall results for an optimal threshold are obtained without the need for RUS; determining whether to use the default threshold with a 1:1 class ratio obtained via RUS requires a proper consideration of the tradeoff involved; the default threshold yields poor results when the dataset is imbalanced. Future work will use our threshold optimization approach with datasets from other application domains. Also, the use of additional constraints will be evaluated.

Availability of data and materials

Not applicable.

Abbreviations

ANN:

Artificial neural network

ANOVA:

Analysis of variance

AUC:

Area Under the Receiver Operating Characteristic Curve

AUPRC:

Area Under the Precision–Recall Curve

CAE:

Convolutional autoencoder

CNN:

Convolutional neural network

ET:

Extremely Randomized Trees

FAU:

Florida Atlantic University

FN:

False negative

FNR:

False negative rate

FP:

False positive

FPR:

False positive rate

GBDT:

Gradient-boosted decision tree

HSD:

Honestly significant difference

k-NN:

k-Nearest neighbor

MCC:

Matthews Correlation Coefficient

PCA:

Principal component analysis

ROS:

Random oversampling

RUS:

Random undersampling

SMOTE:

Synthetic minority oversampling technique

SVM:

Support vector machine

TN:

True negative

TNR:

True negative rate

TP:

True positive

TPR:

True positive rate

ULB:

Université Libre de Bruxelles

References

  1. Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing high-class imbalance in big data. J Big Data. 2018;5(1):42.

    Article  Google Scholar 

  2. Kesici M, Saner CB, Yaslan Y, Genc VI. Cost sensitive class-weighting approach for transient instability prediction using convolutional neural networks. In: 2019 11th international conference on electrical and electronics engineering (ELECO). IEEE; 2019. p. 141–5.

  3. Johnson JM, Khoshgoftaar TM. Output thresholding for ensemble learners and imbalanced big data. In: 2021 IEEE 33rd international conference on tools with artificial intelligence (ICTAI). IEEE; 2021. p. 1449–54.

  4. Hasanin T, Khoshgoftaar TM, Leevy JL, Bauder RA. Severely imbalanced big data challenges: investigating data sampling approaches. J Big Data. 2019;6(1):1–25.

    Article  Google Scholar 

  5. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. A comparative study of data sampling and cost sensitive learning. In: 2008 IEEE international conference on data mining workshops. IEEE; 2008. p. 46–52.

  6. Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–94.

  7. Hancock JT, Khoshgoftaar TM. CatBoost for big data: an interdisciplinary review. J Big Data. 2020;7(1):1–45.

    Article  Google Scholar 

  8. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  MATH  Google Scholar 

  9. Acosta MRC, Ahmed S, Garcia CE, Koo I. Extremely randomized trees-based scheme for stealthy cyber-attack detection in smart grid networks. IEEE Access. 2020;8:19921–33.

    Article  Google Scholar 

  10. Wang Q, Yu S, Qi X, Hu Y, Zheng W, Shi J, Yao H. Overview of logistic regression model analysis and application. Zhonghua yu fang yi xue za zhi [Chin J Prev Med]. 2019;53(9):955–60.

    Google Scholar 

  11. Kaggle: credit card fraud detection. https://www.kaggle.com/mlg-ulb/creditcardfraud.

  12. Zhang X, Gweon H, Provost S. Threshold moving approaches for addressing the class imbalance problem and their application to multi-label classification. In: 2020 4th international conference on advances in image processing; 2020. p. 72–7.

  13. Buda M, Maki A, Mazurowski MA. A systematic study of the class imbalance problem in convolutional neural networks. Neural Netw. 2018;106:249–59.

    Article  Google Scholar 

  14. Zhang H, Huang L, Wu CQ, Li Z. An effective convolutional neural network based on smote and gaussian mixture model for intrusion detection in imbalanced dataset. Comput Netw. 2020;177: 107315.

    Article  Google Scholar 

  15. Cohen G, Afshar S, Tapson J, Van Schaik A. EMNIST: extending MNIST to handwritten letters. In: 2017 international joint conference on neural networks (IJCNN). IEEE; 2017. p. 2921–6.

  16. Yang L, Bankman D, Moons B, Verhelst M, Murmann B. Bit error tolerance of a CIFAR-10 binarized convolutional neural network processor. In: 2018 IEEE international symposium on circuits and systems (ISCAS). IEEE; 2018. p. 1–5.

  17. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Commun ACM. 2017;60(6):84–90.

    Article  Google Scholar 

  18. Calvert CL, Khoshgoftaar TM. Threshold based optimization of performance metrics with severely imbalanced big security data. In: 2019 IEEE 31st international conference on tools with artificial intelligence (ICTAI). IEEE; 2019. p. 1328–34.

  19. Zou Q, Xie S, Lin Z, Wu M, Ju Y. Finding the best classification threshold in imbalanced classification. Big Data Res. 2016;5:2–8.

    Article  Google Scholar 

  20. Salekshahrezaee Z, Leevy JL, Khoshgoftaar TM. Feature extraction for class imbalance using a convolutional autoencoder and data sampling. In: 2021 IEEE 33rd international conference on tools with artificial intelligence (ICTAI). IEEE; 2021. p. 217–23.

  21. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikit-learn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.

    MathSciNet  MATH  Google Scholar 

  22. Gupta A, Nagarajan V, Ravi R. Approximation algorithms for optimal decision trees and adaptive TSP problems. Math Oper Res. 2017;42(3):876–96.

    Article  MathSciNet  MATH  Google Scholar 

  23. González S, García S, Del Ser J, Rokach L, Herrera F. A practical tutorial on bagging and boosting based ensembles for machine learning: algorithms, software tools, performance study, practical perspectives and opportunities. Inf Fusion. 2020;64:205–37.

    Article  Google Scholar 

  24. Lemaître G, Nogueira F, Aridas CK. Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res. 2017;18(1):559–63.

    Google Scholar 

Download references

Acknowledgements

We would like to thank the reviewers in the Data Mining and Machine Learning Laboratory at Florida Atlantic University.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

JLL searched for relevant papers and drafted the manuscript. All authors provided feedback to JLL and helped shape the work. JLL, JMJ, and JH prepared the manuscript. TMK introduced this topic to JLL and helped to complete and finalize the work. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Joffrey L. Leevy.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendices

Appendix 1: Results for original class ratio

See Tables 14, 15, 16 and 17.

Table 14 Results for the extremely randomized trees depth 8 classifier
Table 15 Results for the logistic regression classifier
Table 16 Results for the random forest depth 4 classifier
Table 17 Results for the XGBoost depth 1 classifier

Appendix 2: Results for 1:1 class ratio

See Tables 18, 19, 20 and 21.

Table 18 Results for the extremely randomized trees depth 8 classifier
Table 19 Results for the logistic regression classifier
Table 20 Results for the random forest depth 4 classifier
Table 21 Results for the XGBoost depth 1 classifier

Appendix 3: Results for 1:3 class ratio

See Tables 22, 23, 24 and 25.

Table 22 Results for the extremely randomized trees depth 8 classifier
Table 23 Results for the logistic regression classifier
Table 24 Results for the random forest depth 4 classifier
Table 25 Results for the XGBoost depth 1 classifier

Appendix 4: Results for 1:9 class ratio

see Tables 26, 27, 28 and 29.

Table 26 Results for the Extremely randomized trees depth 8 classifier
Table 27 Results for the logistic regression classifier
Table 28 Results for the random forest depth 4 classifier
Table 29 Results for the XGBoost depth 1 classifier

Appendix 5: Results for 1:27 class ratio

See Tables 30, 31, 32 and 33.

Table 30 Results for the extremely randomized trees depth 8 classifier
Table 31 Results for the logistic regression classifier
Table 32 Results for the random forest depth 4 classifier
Table 33 Results for the XGBoost depth 1 classifier

Appendix 6: Results for 1:81 class ratio

See Tables 34, 35, 36 and 37.

Table 34 Results for the extremely randomized trees depth 8 classifier
Table 35 Results for the logistic regression classifier
Table 36 Results for the random forest depth 4 classifier
Table 37 Results for the XGBoost depth 1 classifier

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Leevy, J.L., Johnson, J.M., Hancock, J. et al. Threshold optimization and random undersampling for imbalanced credit card data. J Big Data 10, 58 (2023). https://doi.org/10.1186/s40537-023-00738-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00738-z

Keywords