Evaluating classifier performance with highly imbalanced Big Data

Hancock, John T.; Khoshgoftaar, Taghi M.; Johnson, Justin M.

doi:10.1186/s40537-023-00724-5

Research
Open access
Published: 11 April 2023

Evaluating classifier performance with highly imbalanced Big Data

John T. Hancock¹,
Taghi M. Khoshgoftaar¹ &
Justin M. Johnson¹

Journal of Big Data volume 10, Article number: 42 (2023) Cite this article

3213 Accesses
15 Citations
Metrics details

Abstract

Using the wrong metrics to gauge classification of highly imbalanced Big Data may hide important information in experimental results. However, we find that analysis of metrics for performance evaluation and what they can hide or reveal is rarely covered in related works. Therefore, we address that gap by analyzing multiple popular performance metrics on three Big Data classification tasks. To the best of our knowledge, we are the first to utilize three new Medicare insurance claims datasets which became publicly available in 2021. These datasets are all highly imbalanced. Furthermore, the datasets are comprised of completely different data. We evaluate the performance of five ensemble learners in the Machine Learning task of Medicare fraud detection. Random Undersampling (RUS) is applied to induce five class ratios. The classifiers are evaluated with both the Area Under the Receiver Operating Characteristic Curve (AUC), and Area Under the Precision Recall Curve (AUPRC) metrics. We show that AUPRC provides a better insight into classification performance. Our findings reveal that the AUC metric hides the performance impact of RUS. However, classification results in terms of AUPRC show RUS has a detrimental effect. We show that, for highly imbalanced Big Data, the AUC metric fails to capture information about precision scores and false positive counts that the AUPRC metric reveals. Our contribution is to show AUPRC is a more effective metric for evaluating the performance of classifiers when working with highly imbalanced Big Data.

Introduction

The use of a single metric to draw conclusions on the impact of a factor in classification experiments may lead to mistakes that we wish to help our fellow researchers avoid. Random Undersampling (RUS) is an appealing strategy for mitigating class imbalance in Big Data. It can drastically reduce the size of the training data used during the model training phase of Machine Learning. Less training data translates into faster training times for many Machine Learning algorithms. Therefore, applying RUS may save a researcher time when conducting experiments. Our contribution is to reveal that there is a trade-off for applying RUS that one should consider before concluding that applying RUS is the best choice. We show that RUS may have a positive impact on Area Under the Receiver Operating Characteristic Curve (AUC) [1] scores. At the same time, we show RUS may have a clear negative impact on Area Under the Precision Recall Curve (AUPRC) [2] scores.

This underscores the importance of evaluating results in terms of more than one metric. Therefore, if performance in terms of AUPRC is important, applying RUS may not be a viable option. We validate these findings using three distinct data sets and five popular ensemble learners in the task of Medicare fraud detection. In our experiments, we apply RUS to induce five different levels of minority:majority class ratios, and classify datasets of varying sizes. The smallest dataset we work with has approximately 12 million instances. We also perform experiments with a dataset that has approximately 68 million instances, and another dataset that has approximately 175 million instances. For each dataset we find the same pattern holds: AUC scores are either not affected, or improved by RUS, and AUPRC scores are degraded.

The application domain of our study is automated Medicare insurance fraud detection. Medicare is the United States public health insurance program, primarily dedicated to individuals aged 65 and over. The organization responsible for the Medicare program is the Centers for Medicare and Medicaid Services (CMS). To foster research, the CMS maintains a repository of publicly available Medicare insurance claims data. We use data from three different sources in this study. Data from each source has unique attributes. The smallest dataset we use is from a section of the CMS website titled “Medicare Durable Medical Equipment, Devices & Supplies—by Referring Provider and Service” (DMEPOS) [3]. We construct a dataset of approximately 12 million instances from this data. The next largest data we use is from a section of the CMS website “Medicare Physician & Other Practitioners—by Provider and Service” (Part B) [4]. We derive a dataset with approximately 68 million instances from the Part B data. Finally, the largest data we use is from a section of the CMS website titled “Medicare Provider Utilization and Payment Data: Part D Prescriber” (Part D) [5]. We compile a dataset of nearly 175 million instances from the Part D data. The CMS regularly adds to each of these data sources, which lends them the aspects of volume and velocity that characterizes Big Data [6].

The volume and velocity of the Part B, Part D, and DMEPOS data also reflects the quantity and speed at which the CMS receives insurance claims from healthcare providers. Currently, it is possible for some dishonest providers to get away with submitting fraudulent claims to Medicare and avoid detection. The volume of claims submitted is large enough that a small fraction of undetected fraudulent claims still translates to large dollar amounts. It is a fact that, in 2019, the Department of Justice was able to recover approximately three billion dollars in fraudulently obtained funds by prosecuting rogue healthcare providers [7]. Nevertheless, there is a degree of uncertainty surrounding how much money the CMS loses due to fraud. The CMS does not report an estimate of funds paid on fraudulent claims. Rather, it reports an estimate of “improper payments.” For 2019, the CMS reported it made approximately $100 billion dollars in improper payments [8]. The CMS defines improper payments to include payments due to fraud, as well as payments due to mistakes on the CMS’s part. Reliable, automated fraud detection would provide a means for the CMS to give an estimate of the percentage of improper payments due to fraud. This would in turn provide a stronger justification for law enforcement to pursue the recovery of money stolen by fraudsters. Our application domain is Machine Learning for Medicare fraud detection. Therefore, our work is a contribution towards the ultimate goal of automated Medicare fraud detection. The benefit of better fraud detection is that government can put funds to better use, or lower taxes due to a reduced cost of the program.

There are valid concerns to be raised on the subject of automated fraud detection. Chief among them is the possibility of accusing legitimate providers of fraud when they are innocent. Since we refer to the population of fraudulent providers as the positive class in our classification framework, accusation of innocent healthcare providers is equivalent to a false positive. Clearly, numerous false positives would mean that automated fraud detection would be doing more harm than good. This is why the key finding in our work is important in the field of automated Medicare fraud detection, since it reveals a better way to detect when a classifier that is applied to classify highly imbalanced Big Data yields many false positives. AUC is a popular metric for evaluating highly imbalanced Big Data, for example [9, 10]. However, our research shows AUC might not always provide a complete picture of classification results. Our experimental results show how the AUPRC metric can provide a clearer signal that a model is generating false positives, when compared to AUC. A look at the definitions of the components of AUC and AUPRC reveals why AUPRC is a better herald of false positives. On one hand, AUPRC is calculated by plotting the precision and recall scores a model yields as we vary the output probability threshold for the classification decision from zero to one. Precision is defined as

$$\begin{aligned} \frac{\text {true positives}}{\text {true positives} + \text {false positives,}} \end{aligned}$$

and the definition of recall is

$$\begin{aligned} \frac{\text {true positives}}{\text {true positives} + \text {false negatives.}} \end{aligned}$$

On the other hand, AUC is calculated by plotting true positive rate and false positive rate as the output probability threshold varies from zero to one. The true positive rate is

$$\begin{aligned} \frac{\text {true positives}}{\text {true positives}+\text {false negatives}} \end{aligned}$$

and false positive rate equivalent to

$$\begin{aligned} \frac{\text {false positives}}{\text {true negatives}+\text {false positives}}. \end{aligned}$$

Since true positive rate and recall are actually the same quantity, the difference in AUC and AUPRC must come from the difference between precision and the false positive rate. We see that the false positive rate involves true negatives, whereas precision does not. In highly imbalanced Big Data, where the positive class is the minority class, the true positives in the formula for precision should be small numbers, so that when the number of false positives starts to grow, it can quickly dominate the value of precision. Hence, precision can easily reflect the number of false positives in classifying imbalanced Big Data. A similar analysis of the terms involved in calculating the false positive rate shows that false positives get drowned out due to the size of the negative class. Since the denominator in the definition of the false positive rate is the size of the negative class, and the size of the negative class is large in imbalanced Big Data, a change in the number of false positives may be difficult to perceive.

To solidify the argument, we give an example with some hypothetical numbers. Let us assume we have a dataset where the size of the negative class is two million. Furthermore, let us assume we have done a classification of the data for some output probability threshold, and we have 1800 true positives and 2000 false positives. Moreover, the sample has 2000 positive instances. From these numbers, we can calculate precision and false positive rate. The precision is

$$\begin{aligned} \frac{1800}{1800 + 2000} \approx 0.47, \end{aligned}$$

and the false positive rate is

$$\begin{aligned} \frac{2000}{2,000,000} = 0.001 \end{aligned}$$

Now let us assume that for a different output probability threshold the number of false positives has increased to 4000. Then the new value of precision is

$$\begin{aligned} \frac{1800}{1800 + 4000)} \approx 0.31, \end{aligned}$$

and the false positive rate is

$$\begin{aligned} \frac{4000}{2,000,000} = 0.002 \end{aligned}$$

In this example, doubling the total number of false positives decreases the precision score by 0.16, but only increases the false positive rate by 0.001. Since both values are used directly as values of coordinates on curves that occupy the same square with area $1\times 1$ in the x–y coordinate plane, we can compare them directly. Therefore, for this example, we conclude that the increase in false positives has a bigger impact on precision and AUPRC, than on the false positive rate and AUC. Therefore, it is possible that if a factor in some experiments causes a larger number of false positives, AUC will not reflect the impact, but AUPRC will.

We perform a collection of experiments that provides an example of how AUC will not reflect the impact of a factor on classification results, but AUPRC will. One factor tested in this collection of experiments is the minority:majority class ratio in the training data. We apply RUS to induce five different class ratios. The second factor in our experiments is the type of classifier used. We use five popular, open source ensemble learners: CatBoost [11], XGBoost [12], LightGBM [13], Random Forest [14], and Extremely Randomized Trees (ET) [15]. Our results show that, regardless of learner and dataset, AUPRC scores are diminished as the class ratio gets closer to 1:1. The AUC scores for the same experiments do not reflect the relationship RUS has with AUPRC. This leads us to the conclusion that AUPRC scores reveal more about the impact of RUS than AUC scores. Our goal is to provide a thorough justification for this conclusion. Along the way to providing this justification, to the best of our knowledge, we make several novel contributions:

we are the first to use the Part B, Part D, and DMEPOS data, which became available in 2021, in a peer-reviewed study;
we are the first to show that classification of this new Big Data should be evaluated in terms of AUPRC,
we are the first to use five ensemble learners to do Medicare fraud detection,
we are the first to employ a Random Forest implementation that runs on Graphics Processing units (GPUs) do to Medicare fraud detection, and
we are the first to use ET to do Medicare fraud detection.

The remainder of this study is organized into the following sections: Related Works, Algorithms, Data Description and Preparation, Methodology, Results, Statistical Analysis, and Conclusions.

Related work

“Data sampling approaches with severely imbalanced big data for medicare fraud detection” is a 2018 study by Bauder et al. [9]. In their study, the authors combine Part B, Part D, and DMEPOS Medicare claims data to form a dataset for Medicare fraud detection via classification. Hence, their study is in the same application domain as ours, albeit with less data than we use, since we use data was not available at the time their study was written. The data they work with has under one million instances. Bauder et al. employ six data sampling techniques in their experiments. The six techniques are RUS, Random Oversampling (ROS), Synthetic Minority Oversampling Technique (SMOTE) [16], two variations of Borderline SMOTE (also covered in [16], and Adaptive Synthetic (ADASYN) [17]. The sampling techniques are used to induce minority:majority class ratios of 1:99, 10:90, 25:75, 65:35, and 50:50. Experiments with the original class ratio of 473:759,267 (approximately 0.00062) are performed as well. For classification experiments, they use Apache Spark [18] implementations of Random Forest, Logistic Regression [19] and Gradient Boosted Trees [20]. To evaluate the performance of the combinations of classifiers and data sampling techniques, the authors use AUC. Bauder et al. conclude that classifiers trained on data with RUS applied to it yield significantly better performance, in terms of AUC, than classifiers trained on data with the original class ratio. Since Bauder et al. prefer RUS to other sampling techniques, we employ only RUS. However, we measure classification results in terms of the AUPRC metric as well. Our results are unique and meaningful, since, on one hand we duplicate Bauder et al.’s results that show an improvement in AUC scores when RUS is applied, but on the other hand, we show that AUPRC scores decline when RUS is used.

Hasanin et al. [21] study the effect of RUS on Geometric Mean [22] and AUC scores. They find RUS has a positive impact on AUC and Geometric Mean scores in the classification of imbalanced Big Data. Here, we investigate the performance of RUS on AUC and AUPRC scores. One advantage AUC and AUPRC have over the Geometric Mean is that they reflect performance over a range of model output probability threshold values, whereas in order to calculate Geometric Mean, one must select a specific output probability threshold. Therefore, a model’s Geometric Mean score can only tell us about the performance of the model for one particular threshold value. Another issue that sets our study apart from Hasanin et al. is that the largest dataset used in their study has under 1.7 million instances. The data we work with here is orders of magnitude larger. Hasanin et al. report that they use one-hot encoding for all categorical features. Here we use CatBoost encoding [11], a technique that is more scalable than one-hot encoding since it does not require the introduction of additional attributes to the dataset. For example, to one-hot encode a categorical value that has thousands of possible values, we would need to add thousands of attributes to our dataset. However, CatBoost encoding requires no additional space consumption.

Another study where the authors choose to use Geometric mean as a performance metric is by Del Río et al. In this 2014 study, the authors investigate the impact of RUS on Big Data classification. A second performance metric they use is $\beta$f-measure [23]. As a part of their study, experiments are performed as applications of Random Forest to classify various datasets, which are treated with undersampling. The largest dataset they employ in their experiments has less than six million instances. Apart from the size of the dataset, another aspect of our work that sets it apart from Del Río et al. is how they treat their data with RUS. Del Río et al. apply RUS to induce a 1:1 class ratio. We apply RUS to induce five different class ratios, which enables us to report the effect of varying levels of RUS. This is important since we aim to show the effects of RUS on AUC and AUPRC.

Research into the impact of RUS on the classification of Big Data often involves the Apache Spark framework since it is well suited to Big Data. One such study is by Sleeman and Krawczyk [24]. In their experiments, they work with datasets that have at most three million instances. Sleeman and Krawczyk do a thorough job of reporting performance metrics in their study, however, they do not report performance for the AUC or AUPRC metrics. The focus of our study is on the impact of RUS on two well-known metrics, AUC and AUPRC. A separate aspect of our study that differentiates it from Sleeman and Krawczyk’s is that we investigate the effect of RUS at different levels. Sleeman and Krawczyk investigate RUS to induce a 1:1 class ratio. We use RUS to induce five class ratios, thus we can treat RUS more thoroughly as a factor in our experiments. The differences between Sleeman and Krawczyk’s study and our own imply that our results serve different purposes.

In “Threshold based optimization of performance metrics with severely imbalanced big security data” Calvert and Khoshgoftaar use multiple metrics to evaluate multiple classifiers [25]. The application domain for their study is information systems network security. Hence, their results reveal the ability of Machine Learning algorithms to detect malicious network traffic. Since most of the traffic in their dataset is benign, the classification task is an exercise in the classification of imbalanced data. The data they use in their experiments has approximately 1.7 million instances. To give a sense of the level of class imbalance, the dataset Calvert and Khoshgoftaar use has a minority to majority class ratio of approximately 0.0014. Therefore, their data resembles ours, however we find our dataset is more realistic, since Calvert and Khoshgoftaar generate the malicious traffic in the raw network traffic data they use in their experiments. In a sense our raw data is more natural, since we do not play a role in generating it. Their key finding is that one classifier yields the best performance in terms of AUC, but significantly worse in terms of other metrics. Furthermore, they find one classifier yields the best performance in multiple metrics other than AUC. They claim this result indicates AUC alone cannot identify the best performing model. The principal item that differentiates our study from Calvert and Khoshgoftaar’s is that they do not use RUS as a factor in any of their experiments.

One type of classifier we have not discussed yet is Neural Network classifiers. Johnson and Khoshgoftaar document experiments with Neural Network-based classifiers and RUS in [26]. The performance metrics they use to evaluate classification results are AUC, Geometric Mean, True Positive Rate, and True Negative Rate. The application domain for their study is Medicare fraud detection. One thing that sets our study apart from Johnson and Khoshgoftaar’s is our use of the AUPRC metric. Our studies have use of the AUC metric in common. Johnson and Khoshgoftaar find that when RUS is applied to their data to induce a class ratio with the minority class occupying more than 20% of the data, AUC scores begin to deteriorate. In a further demonstration on the effect the RUS technique, they show that all metrics reflect worsening performance as RUS is used to grow the proportion of the minority class in the training data. Johnson and Khoshgoftaar work with data similar to ours, however, they apply an aggregation step to prepare their data for experiments. The aggregation reduces the size of their dataset to under five million instances. The aggregation also eliminates the highest cardinality categorical features. They use one-hot encoding for the remainder of the categorical features. For a study on many available options for encoding categorical features, please see [27]. For this study, we selected CatBoost encoding, which has the advantage of supporting much higher cardinality categorical features than can be practical with one-hot encoding. Due to the differences we have listed here, our study represents a contribution that is separate from what Johnson and Khoshgoftaar have to offer.

In a later work, Johnson and Khoshgoftaar use Geometric Mean and AUC to evaluate the performance of Deep Learning algorithms to classify imbalanced Big Data Medicare [28]. They apply RUS to have the minority class constitute larger percentages of the training data. The dataset used in their experiments has approximately five million instances. Results in this study show metrics are even more sensitive to RUS than their previous study. In this study, they find performance, in terms of both metrics, begins to suffer when the minority class becomes more than one percent of the training data. One major difference between our study and Johnson and Khoshgoftaar’s is that we use AUPRC, whereas they use Geometric Mean. Also, as in their previous study, Johnson and Khoshgoftaar use an aggregation technique which eliminates high-cardinality categorical features, and then use one-hot encoding to encode remaining categorical features. For reasons mentioned previously, we prefer to apply CatBoost encoding to our categorical features. Hence, our study also differs significantly from this second study by Johnson and Khoshgoftaar.

In “An insight into imbalanced big data classification: outcomes and challenges”, Fernandez et al. report on the effects of RUS in the classification of imbalanced Big Data with the Apache Hadoop [29] and Spark distributed computing frameworks. The largest dataset used in their experiments contains approximately 12 million instances. While Fernandez et al. apply RUS to their data, they induce only the 1:1 class ratio. As stated previously, our experiments involve five different levels of RUS, which enables a more thorough review of the impact of RUS. Fernandez et al. find RUS improves classification performance, in terms of Geometric Mean. Our aim is to compare the impact of RUS on two different performance metrics, AUC and AUPRC, hence to provide results in a domain separate from those of Fernandez et al. Although Fernandez et al. use imbalanced data in their experiments, the data we use is more imbalanced. Their data has an initial minority to majority class ratio of 1:50, whereas ours has an initial ratio of 1:256. Therefore, our study involves a larger, more imbalanced dataset.

In the aptly named “The Precision-Recall Plot Is More Informative than the ROC Plot When Evaluating Binary Classifiers on Imbalanced Datasets”, Saito and Rehmsmeier compare classification results in terms of AUC and AUPRC [30]. They perform experiments in the microRNA gene discovery application domain with multiple classifiers and datasets. Their results are similar to ours in that they demonstrate, for varying levels of class imbalance, that AUC curves show good performance. However, for the same experiments, the AUPRC curves reveal stark differences in performance. Saito and Rehmsmeier do not discuss whether their results apply to datasets with high cardinality, categorical features. We show our results apply to datasets, with categorical features that have thousands of possible values. The datasets Saito and Rehmsmeier use are small, all with less than 15,000 instances each. We show the merits of AUPRC as a metric for classifying highly imbalanced Big Data. In addition, we show that as the size of the dataset grows, AUPRC is the more informative metric to assess the impact of experimental factors.

Related works cover concepts that overlap with our study. We find research where RUS is a factor in experiments with highly imbalanced Big Data. However, we do not find a study that reveals insights into the divergent effect of RUS on AUC and AUPRC scores in the classification of highly imbalanced Big Data. We feel our contribution is an important one since it shows that focus on AUC alone can cause one to overlook the negative impact of RUS, and therefore possibly other factors, on classification performance.

Classification algorithms

As a means of ensuring reproducible results, we employ five publicly available, open-source ensemble learners as the classifiers in our experiments. As mentioned previously the learners are: LightGBM, XGBoost, ET, CatBoost, and Random Forest. The learners fall into two distinct families of algorithms. ET and Random Forest are members of the Bagging family of learners. CatBoost, XGBoost, and LightGBM hail from the Gradient Boosted Decision Tree family of Machine Learning algorithms. The advantage to using algorithms that exploit different general techniques is that we can show our results apply to more than just one type of algorithm. Bagging and Gradient Boosted Decision Trees take two different approaches to using a collection of learners to perform classification.

Breiman introduces the Bagging technique for Machine Learning in a 1996 study, [31]. Breiman explains that Bagging can be used in classification and regression problems. Our study involves experiments in binary classification, so we focus on Breiman’s treatment of Bagging as it pertains to binary classification. The Bagging technique is based on applying a Machine Learning algorithm (learner) to bootstrap samples of the training data to train a collection (ensemble) of instances of the algorithm. A bootstrap sample is a sample, with replacement, from the training data [32]. After the learners are fit to the bootstrap samples, each learner classifies instances of the test data. The classification result is taken as the classification returned by the majority of learners in the ensemble. A probabilistic argument explains how Bagging may improve classification results.

Assume there is a chance that a poorly performing (weak) learner makes a correct classification more often than not. Then, the chance of a majority of weak learners in an ensemble making the correct classification increases as we increase the number of weak learners. In that case if we treat the ensemble as a classifier it will perform better than one of the constituent learners by itself.

Random Forest relies on the Bagging principle, and adds an enhancement to it. Breiman is also credited with the seminal implementation of Random Forest [14]. Random Forest is an application of the Bagging technique to decision trees, with an addition. In order to explain the enhancement to the Bagging technique, we must first define the term “split” in the context of decision trees. The internal nodes of a decision tree consist of rules that specify which edge to traverse next. The rule is based on a comparison of one numeric value, the split, versus the current value of one of the independent variables in the dataset. Hence, fitting a decision tree to a dataset heavily involves determining the optimal values for splits. The enhancement Random Forest makes to the Bagging approach is to randomly sample a subset of the attributes to use in determining the optimal value for a split. We include Random Forest, since it has been applied successfully to the classification of highly, or severely imbalanced Big Data [33]. As mentioned previously, to the best of our knowledge, this is the first study on Medicare fraud detection where an implementation of Random Forest that runs on Graphics Processing Units (GPUs)^{Footnote 1} is used. In preliminary experiments we found this GPU implementation of Random Forest to have a much faster running time when compared to the CPU based implementation we used previously.

The second classifier from the Bagging family of Machine Learning algorithms we employ is the ET classifier [15]. ET is an extension of Random Forest where we choose the values for splits in the decision tree randomly. In Random Forest, and other decision-tree based learners, splits are usually calculated systematically. For example, one may calculate the optimal value for a split in a decision tree based on some metric that gauges how well the splitting rule divides the training data into subsets that all have the same label. ET does away with systematic ways of determining the values for splits and chooses them randomly. Perhaps surprisingly, our results show ET’s random selection of splits can turn out to yield the best performance in classifying highly imbalanced Big Data for Medicare fraud detection.

The remaining classifiers used in our study are descended from the Gradient Boosted Machine algorithm discovered by Friedman [34]. The Gradient Boosting Machine technique is an ensemble technique, but the way in which the constituent learners are combined is different from how it is accomplished with the Bagging technique. The Gradient Boosting Machine technique begins with a single learner that makes an initial set of estimates $\hat{\textbf{y}}$ of the dependent variable $\textbf{y}$. The differences (residuals) in the estimates $\hat{\textbf{y}}$ and $\textbf{y}$ forms a vector $\textbf{y}-\hat{\textbf{y}}$ that we can think of as a new dependent variable that we can estimate with the original independent variables and a second learner. Then, the sum of the output values of the two models will be a more accurate estimate of the dependent variable than the output values of the first model. We can continue to add learners to the ensemble similarly, where each new learner is trained to predict the residuals of the current ensemble. Therefore, each learner we add to the ensemble provides a better estimate of the dependent variable. The Gradient Boosting Machine implementations we use are all enhancements to Friedman’s initial proposal. They all involve a specific type of learner, Decision Tree, so we refer to them as Gradient Boosted Decision Trees (GBDTs).

Of the three GBDT implementations we use, XGBoost was the first to be released. Chen and Guestrin released XGBoost in 2016. XGBoost offers several enhancements to the GBDT technique. The first enhancement is an improved loss function used during the training phase. The loss function contains an additional term for regularization to prevent overfitting. Another enhancement XGBoost makes to GBDTs is one that has to do with calculating splits in the constituent decision trees of the GBDT ensemble. Chen and Guestrin introduce the so called “approximate algorithm” which is a technique for estimating optimal values of splits. The approximate algorithm is suitable for distributed environments, as well as applications where the entire dataset does not fit in main memory. A third enhancement in XGBoost is another algorithm for finding splits that works well with sparse data. Sparse data is the type of data that is nearly constant in value with infrequently occurring aberrations. XGBoost can take advantage of sparse data with its “sparsity aware split finding” feature.

Ke et al. released the seminal paper on LightGBM in 2017 [13]. Their goal was to offer a GBDT implementation that yields performance equivalent to XGBoost, while consuming fewer resources. In order to achieve their goal, Ke et al. make two key enhancements to the GBDT technique. The first is Exclusive Feature Bundling (EFB). EFB is a technique for reducing the dimensions of a dataset by combining two features (attributes) of a dataset into a single feature. EFB is an effective technique for sparse data. When two attributes of a dataset exhibit sparsity, and the infrequently occurring values of both attributes are mutually exclusive, they may be safely combined into a single feature without the loss of information. EFB reduces the number of dimensions of a dataset, which helps reduce training time. LightGBM’s second enhancement to the GBDT technique is called Gradient-based One-Side Sampling (GOSS). GOSS is a technique for intelligently reducing the number training instances used. GOSS selects instances for training based on their contribution to the loss function that is calculated as part of fitting the GBDT ensemble to the training data. If the instance contributes more than a configurable threshold value to the loss of the model, then it is retained for further iterations of the fitting process. Likewise, instances that contribute less than the threshold amount are set aside. Via GOSS and EFB Ke et al. deliver a GBDT implementation that consumes fewer computing resources.

The third GBDT implementation we use is CatBoost [11]. Prokhorenkova et al. introduced CatBoost in 2018. One may find more information on applications of CatBoost in various domains in [35]. Their motivation for developing CatBoost was to prevent overfitting. The first protection against overfitting that CatBoost makes is Ordered Boosting. Ordered Boosting is a technique for selecting training instances. There are two steps for adding a decision tree to the GBDT ensemble. The first is to fit the decision tree to the dependent variable in the training data. Multiple decision trees are fit to different samples of the training data. After the trees have been fit, they must be evaluated in order to select the tree that best enhances the overall performance of the ensemble. Under Ordered Boosting one can be sure that training instances used to fit the decision tree will not be used to evaluate it for inclusion into the ensemble. This helps prevent the ensemble from being overfit to the training data. The second kind of overfitting CatBoost offers protections against is through its Ordered Target Statistics method of encoding categorical features. In simple target encoding, a categorical feature is assigned the mean value of the dependent variable that the feature is observed to co-occur with. This strategy for encoding may lead to information leakage in the sense that if the encoded feature co-occurs with different values of the dependent variable in the test data the encoded feature will not be a useful predictor of the dependent variable. To avoid this issue, Ordered Target Statistics is a technique for ensuring that the encoded value for a categorical feature of a given instance is derived from other instances. Put another way, the encoded value of a categorical feature is not allowed to be calculated from the label it appears with. This makes it impossible for the encoded feature value of the instance to be directly related to the value of the dependent variable. This is a protection against what Prokhorenkova et al. call “target leakage”.

The five ensemble techniques we employ here lend robustness to our experimental design. Using different learners allows us to rule out the possibility that patterns we see in the results are due to the peculiarities of one model. ET and Random Forest are enhancements to Breiman’s Bagging technique. XGBoost, LightGBM, and CatBoost are enhancements to Freidman’s Gradient Boosting technique. In the next section we explain how we prepare the data we use as input to these learners.

Data description and preparation

As mentioned in the introduction, the CMS provides the data used in this study. The most recent CMS data we use in constructing all three datasets first became available in 2021. The latest Part B, Part D, and DMEPOS data all spans the years 2013 through 2019. Previous studies on Medicare fraud detection use data that covers fewer years. Moreover, some of the attributes of the latest data are not available in previous studies where older versions of the Part B, Part D, and DMEPOS are used. For example, in “Leveraging lightgbm for categorical big data” [36] we report our DMEPOS data has nine features, whereas one can see in Table 1 our current version of the DMEPOS data has 18 features. Furthermore, in [36], Part B data is reported to have eight features, and Part D data is reported to have seven features. In Table 2 we show that the Part B data we use has 16 features, and in Table 3, we show our Part D Data has 9 features. To the best of our knowledge, we are the first to employ the most recently available data from CMS in a study.

Table 1 Features of the DMEPOS Dataset

Evaluating classifier performance with highly imbalanced Big Data

Abstract

Introduction

Related work

Classification algorithms

Data description and preparation

Methodology

Results

Statistical analysis

Conclusion

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords