The effects of class rarity on the evaluation of supervised healthcare fraud detection models

The United States healthcare system produces an enormous volume of data with a vast number of financial transactions generated by physicians administering healthcare services. This makes healthcare fraud difficult to detect, especially when there are considerably less fraudulent transactions (documented and readily available) than non-fraudulent. The ability to successfully detect fraudulent activities in healthcare, given such discrepancies, can garner up to $350 billion in recovered monetary losses. In machine learning, when one class has a substantially larger number of instances (majority) compared to the other (minority), this is known as class imbalance. In this paper, we focus specifically on Medicare, utilizing three ‘Big Data’ Medicare claims datasets with real-world fraudulent physicians. We create a training and test dataset for all three Medicare parts, both separately and combined, to assess fraud detection performance. To emulate class rarity, which indicates particularly severe levels of class imbalance, we generate additional datasets, by removing fraud instances, to determine the effects of rarity on fraud detection performance. Before a machine learning model can be distributed for real-world use, a performance evaluation is necessary to determine the best configuration (e.g. learner, class sampling ratio) and whether the associated error rates are low, indicating good detection rates. With our research, we demonstrate the effects of severe class imbalance and rarity using a training and testing (Train_Test) evaluation method via a hold-out set, and provide our recommendations based on the supervised machine learning results. Additionally, we repeat the same experiments using Cross-Validation, and determine it is a viable substitute for Medicare fraud detection. For machine learning with the severe class imbalance datasets, we found that, as expected, fraud detection performance decreased as the fraudulent instances became more rare. We apply Random Undersampling to both Train_Test and Cross-Validation, for all original and generated datasets, in order to assess potential improvements in fraud detection by reducing the adverse effects of class imbalance and rarity. Overall, our results indicate that the Train_Test method significantly outperforms Cross-Validation.


Introduction
The healthcare system in the United States (US) contains an extremely large number of physicians, who perform numerous services for an even larger number of patients. Every day, there are a massive number of financial transactions generated by physicians administering healthcare services, such as hospital visits, drug prescriptions, and other medical procedures. The vast majority of these financial transactions are conducted without any fraudulent intent, but there are a minority of physicians who maliciously defraud the system for personal gain. In machine learning, when a dataset portrays this discrepancy in class representation (i.e. a low number of actual fraud cases), it is known as class imbalance [1,2]. The main issue attributed to class imbalance is the difficulty in discriminating useful information between classes due to the over-representation of the majority class (non-fraud) and the limited amount of information available in the minority class (fraud). In real-world medical practice, the number of fraudulent physicians is in the minority, where even fewer are confirmed, well-documented. To further confound the situation, we found that the number of these known fraudulent physicians are becoming less frequent each year, trending towards class rarity. Class rarity is when the Positive Class Count (PCC), or number of minority class instances, becomes extremely small. We argue that qualifying for severe class imbalance and rarity does not necessarily rely on the proportion between classes, but on PCC. A common class imbalance percentage is 2% [3], but, for example, a dataset with 10,000,000 instances still has a PCC of 200,000. A machine learning model would most likely be able to effectively determine qualities and patterns in the minority class by using 200,000 instances, and would not be effected by issues normally attributed to class imbalance, especially with data sampling. However, these issues would be present if this dataset had an extremely small PCC, such as 100, where a machine learning model would have to contend with 9,999,900 negative instances during training. This latter example indicates the concerns presented with class rarity. These issues are further exacerbated when applied to Big Data, which can dramatically increase the number of majority instances while leaving the minority class representation relatively unchanged [4].
Throughout the literature, the task of defining Big Data has proven rather complicated, without a universally accepted definition [5]. Recently, Senthilkumar et al. [5] provided a definition specifically for healthcare, categorizing Big Data into six V's: Volume, Variety, Velocity, Veracity, Variability, and Value. We employ three publicly available Medicare 'Big Data' datasets released by The Centers for Medicare  is currently available for these Medicare datasets. These CMS datasets include payment information for claims submitted to Medicare, payments made by Medicare, and other data points related to procedures performed, drugs administered, or supplies issued. Both individually and combined, the Medicare datasets provide an extensive view into a physician's annual claims, across three major parts of Medicare. Furthermore, we utilize the the Office of Inspector General's (OIG) List of Excluded Individuals and Entities (LEIE) [6] to generate fraud labels since CMS does not provide any fraud information. For further detail related to Medicare and Medicare fraud, we refer the reader to [7][8][9][10][11].
Even though there are far fewer fraudulent physicians, they still contribute to large financial losses. Many citizens depend on healthcare, especially the elderly population, which comprises the majority of Medicare beneficiaries. The Federal Bureau of Investigation (FBI) concluded that fraud accounts for 3-10% of healthcare costs [12]. Throughout the United States, in 2017, healthcare generated approximately $3.5 trillion in spending [13], with Medicare contributing around 20% [14], receiving roughly $592 million in federal funding while spending up to $702 billion. This translates to around $105 billion to $350 billion lost from fraud per year, with $21 billion to $70 billion from Medicare alone. In order to relate these monetary values to number of affected citizens, the average healthcare cost in the United States, per person, per year, is about $10,000 [15]. Therefore, if healthcare fraud was eliminated altogether, a potential 35 million ($350 billion/$10,000) more Americans could receive full medical assistance per year, with about 6 million from Medicare alone. This demonstrates the potential impact of minimizing fraud. Note, healthcare spending is forecast to increase more than 4% per year through 2026 [13].
A number of studies employed the CMS Medicare datasets to detect fraudulent physician behavior through data mining, machine learning and other analytical methods [16,17], with a large portion of these studies using only Part B data [18][19][20][21][22]. Even though the Part B dataset is comprehensive in its own right, employing only one Medicare part limits the comprehensive assessment of fraud detection performance available to a machine learning model compared to multiple parts. There are a few studies that utilized multiple parts of Medicare including [23][24][25]. Branting et al. [23] used the Part B (2012-2014) and Part D (2013) datasets. They utilize the LEIE for determining fraud labels through their identity-matching algorithm centered around a physician's National Provider Identifier (NPI) [26]. Through this algorithm, they matched over 12,000 fraudulent physicians, but the authors were not as discriminatory in fraud label mapping as we are in this study, leading to the possible inclusion of physicians excluded for charges unrelated to fraud. The authors employed sampling, balancing their dataset to a 50:50 class ratio. The authors may have benefited from examining other class ratios, possibly without removing as many non-fraudulent instances. They developed a method for discriminating fraudulent behavior by determining the fraud risk using graph-based features in conjunction with a decision tree learner resulting in good overall fraud detection. In [24], Sadiq et al. employ Part B, Part D, and DMEPOS datasets, but limit their study to Florida only. They venture to find anomalies that possibly indicate fraudulent or other interesting behavior. The authors use an unsupervised method (Patient Rule Induction Method based bump hunting) to try to detect peak anomalies by spotting spaces of higher modes and masses within the dataset. They conclude that their method can accurately characterize the attribute space of CMS datasets. In [25], we conducted an exploratory study to determine which part of Medicare and which learner allows for the better detection of fraudulent behavior. We used all available years from 2015 and before, while fraud labels were generated through LEIE mapping. However, in [25], our experiments were conducted to show the feasibility of Medicare fraud detection, without focusing on the problem of class imbalance, rarity, or any methods to mitigate adverse effects on model performance. Unfortunately, even with these and the other research currently available, according to [27], current methods are not significantly decreasing monetary losses facing the US healthcare system. Therefore, we continue the efforts to detect Medicare fraud in order to decrease monetary loss due to real-world fraud.
In real-world practice, machine learning models are built on a full training dataset and evaluated on a separate test dataset consisting of new, unseen data points (i.e. hold-out set). We denote this evaluation method as Train_Test, and emulate this process by splitting the CMS datasets into training datasets (all years prior to 2016) and test datasets (the full 2016 year). In essence, the problem can be summarized as: can a model accurately detect new, known fraudulent physicians (from 2016) based on historical patterns of fraud (prior to 2016)? The results from the Train_Test method will provide a clear evaluation of fraud detection performance with Medicare claims data using machine learning. Furthering this sentiment, Rao et al. [28] discuss that: "Any modeling decisions based upon experiments on the training set, even cross validation estimates, are suspect, until independently verified [by a completely new Test dataset]. " Unfortunately, in practice, the means to generate both a separate training and test set is limited, such as when the number of positive cases are too few or when only prior data is available. Therefore, we also conduct all of our experiments using Cross-Validation (CV). CV emulates the Train_Test method by splitting a single dataset into smaller training and test datasets for building and evaluation. This allows practitioners to assess prediction performance without a separate test dataset. For our experiments, we apply CV to our training datasets, and through comparisons with Train_Test results, we determine if CV can be a useful substitute. There are different variants of CV including: k-fold, stratified k-fold, leave-p-out, and holdout [29]. In this study, our experiments employ stratified k-fold CV due to its usefulness and ubiquity in machine learning as well as the its focus on creating balanced class distribution across folds.
We present our novel procedure for data processing and fraud label mapping, to create our Medicare datasets. These Medicare datasets, with the added LEIE fraud labels, have severely imbalanced class distributions. In order to assess the effects of severe class imbalance and rarity, in addition to these original datasets, we generate datasets with growing class imbalance and increasing degrees of rarity by randomly removing positive class instances (i.e. lowering PCC). We also perform data sampling, specifically random undersampling (RUS), to determine whether sampling can effectively mitigate the negative effects of severe class imbalance and class rarity. For each original and generated dataset, we created five additional datasets with varying class ratios, ranging from balanced (50:50) to highly imbalanced (1:99). We evaluate our results over three different learners, across all Medicare parts and the combined dataset, using the area under the receiver operating characteristic (ROC) curve (AUC) and significance testing. Our study has several goals. Primarily, we are determining the effects severe class imbalance and rarity have on the Train_Test evaluation method in Medicare fraud detection, which to the best of our knowledge, we are the first to study. We also compare the Train_Test method to CV, across all experimental configurations. Lastly, we examine overall trends across Train_Test and CV to determine the optimal model configuration in terms of data sampling ratio and learner. From the Train_Test results, we determine that for the severely imbalanced Medicare claims, machine learning was able to discriminate between real-world fraudulent and non-fraudulent physician behavior reasonably well, but as the PCC trended toward rarity, the results generally decreased over all experiments. Overall, the results show that Train_Test significantly outperforms CV. Even so, we conclude that when necessary CV can be a viable substitute, but a practitioner should note that estimates may be conservative. Moreover, CV was similarly affected by severe class imbalance and rarity. Data sampling was demonstrated to mitigate the effects of having such a limited number of positive class instances, where the best class ratios were those with slightly larger negative class representation (i.e. less balanced).
The rest of the paper is organized as follows. The "Related works" section presents related works, focusing on studies that employ datasets with class rarity. We discuss the Medicare and LEIE datasets in the "Data" section, and summarize our data processing and fraud label mapping. We discuss the concepts of severe class imbalance and rarity in the "Severe class imbalance and rarity" section and explain the Train_Test and CV evaluation methods in "Train_Test and Cross-Validation" section. In "Methods/experimental" section, we outline the learners, performance metrics, and significance testing. We then present our experimental results in "Discussion and results" section. Finally, we conclude our study and present future work.

Related works
Throughout the previous academic literature, class imbalance has been widely studied, and as in the real-world, there are many cases where there is a large disparity between classes, such as online shopping (making purchase/not making purchase) [30] and healthcare fraud (fraud/non-fraud). The majority of these studies employ smaller datasets [31][32][33][34][35][36]. Experiments using class imbalance with smaller data, could provide a basic understanding of the effects that class imbalance has on Big Data, but will be limited in understanding specific concerns when using Big Data. For instance, we determined throughout our research, when applying sampling in machine learning, a balanced ratio (50:50) is not as beneficial for Big Data as it is for smaller datasets, at least in the Medicare fraud detection domain. Studies that focus on class rarity are far less common [2,37]. With regards to the research presented in this paper, we limit our discussion to studies that employ Big Data for studying class imbalance in relation to rarity.
In [38], Hasanin et al. use four real-world Big Data sources from the sentiment140 text corpus and the UCI Machine Learning Repository in order to assess the impacts of severe class imbalance on Big Data. To supplement to the original datasets, the authors generate additional datasets ranging from imbalanced to severely imbalanced with the positive class percentages of 10%, 1%, 0.1%, 0.01%, and 0.001%. They use the Random Forest (RF) learner on both Apache Spark and H2O Big Data frameworks to assess classification performance, determining that that 0.1% and 1.0% can provide adequate results. They also experimented with data sampling, specifically RUS, and determined that balanced class ratios provided no benefit over the full datasets. Even though 0.001% is a very large disparity between classes, the authors do not generate any datasets that qualify as a rarity. Fernandez et al. [3] provide a literature survey and experimentation focused on Big Data and class imbalance. They employ Hadoop with MapReduce using the Spark Machine Learning Library (MLlib) [39] versions of undersampling (RUS) and oversampling (ROS), and Synthetic Minority Over-sampling Technique (SMOTE). The authors compare RUS, ROS, and SMOTE over two Big Data frameworks using two imbalanced datasets, derived from the ECBDL14 dataset, which consist of 12 million and 600,000 instances, 90 features and class ratios of 98:2 (majority:minority). They compare these methods across two learners, RF and Decision Tree. They determine that as the number of partitions decreases, RUS has better performance, while ROS performs better with a larger number of partitions, while SMOTE performed inadequately across all experiments. They also recommend that newer, more advanced Big Data frameworks, such as Apache Spark, should be used compared to more dated frameworks. The authors do not remove any positive class instances in order to study the effects of rarity nor severe class imbalance. Another work by Rastogi et al. [40], use the ECBDL14 dataset, with 1.7% positive class representation (PCC = 48,637) and a total of 2.8 million instances and 631 features. They split the data 80% for training and 20% for testing. They compare Python SMOTE to their own version of SMOTE based on Locality Sensitive Hashing implemented in Apache Spark, demonstrating their model is superior. The authors do not test their method on a dataset with a rare number of positive class members.
Two studies that employ relatively Big Data with rarity are [41] and [42]. Dong et al. [41] develop a deep learning model for very imbalanced datasets, employing batchwise incremental minority class rectification along with a scalable hard mining principle. They evaluate their method's performance on a number of datasets, including a clothing attribute benchmark dataset (X-domain) with a PCC of 20 and 204,177 negative class instances (clothing attributes dataset). This dataset contains multiple classes, but they also test a binary dataset with 3713 positive instances and 159,057 negative. Through their experiments, they found their method was superior, with a minimum 3-5% increase in accuracy, with the additional benefit of being up to seven times faster. In [42], Zhai et al. utilize seven different datasets from various data sources including one artificial dataset with 321,191 negative class instances and 150 positive. They developed an algorithm based on MapReduce and ensemble extreme learning machine (ELM) classifiers, and determine their method superior compared to three different versions of SMOTE. Through Zhai et al's. research, it is hard to infer the effects of rarity of the aforementioned dataset because their datasets are gathered from multiple sources. In order to determine the effects of rarity and the ability of their model, it would have been beneficial if datasets from the same source were tested with varying PCC values.
In a study conducted by Tayal et al. [43], the authors perform experiments with realworld datasets derived from the standard KDD Cup 1999 data, where the largest contained 812,808 instances. The positive class made up 0.098%, with a PCC of around 800, qualifying this dataset as severely imbalanced, but not rarity. This latter point is because 800 instances could still provide a reasonable level of discrimination for a machine learning model. They determined that their RankRC method was able to outperform several SVM methods and was more efficient with processing speed and space required. Maalouf et al. [44] present a truncated Newton method in prior correction logistic regression (LR) including an additional regularization term to improve performance. They also employ the KDD Cup 1999 dataset, along with six others. The largest dataset they use has 304,814 instances with the positive representation at 0.34%, translating to a PCC of a little over 1000. In [45], Chai et al. generate a dataset from a manufacturer and user facility device experience database with the goal of automatically identifying health information technology incidents. The subset consists of 570,272 instances with a PCC of 1534. They generate two additional subsets, one balanced (50:50) and another with 0.297% class representation. They employ statistical text classification through LR. These studies utilize data that can be described as relatively big, but do not assess the effects of rarity.
Zhang et al. [30] discuss the vast amounts of data created from online shopping websites and the large imbalance between purchases made versus visits made without a purchase. The level of imbalance quickly escalates when considering high spending customers (i.e. over $100). They found that in a week, a retail website had 42 million visits with only 16,000 purchases resulting in a ratio of 1:2,500, while high spending customers were 1:10,000. They developed an adaptive sampling scheme that samples from severely imbalanced data. Through this method, the authors ensure that when sampling data, they obtain a satisfactory number of positive class instances by searching through the original data. We would argue that when the effects of imbalance can be solved by searching for more available positive class instances, then the domain in question does not suffer from the traditional effects attributed to class imbalance even if that ratio is (1:10,000) or worse. The real effects of severe class imbalance and rarity are felt when, throughout all available data, the resultant PCC is so minimal, a machine learning algorithm cannot discriminate useful patterns from the positive class. As mentioned, we note that the number of available real-world fraudulent physicians matching with the CMS Medicare datasets are decreasing, moving the Medicare fraud detection domain towards rarity. To the best of our knowledge, we are the first study to assess the effects of class rarity using the Train_Test evaluation method with Medicare Big Data using realworld fraud labels.

Data
In this section, we summarize the Medicare datasets, LEIE, and our data preparation and feature engineering. In conjunction to our brief summary, we provide discussions in [25] to cover all data-related details not specifically covered in this work. Since this study aims to predict fraudulent behavior as it appears in real-world medical practice, we utilize the LEIE, which currently contains the most comprehensive list of real-world fraudulent physicians throughout the United States. To the best of our knowledge, there is no publicly available database containing both provider claims activity and fraud labels, and therefore, we use the LEIE to supplement the Medicare datasets, allowing for an accurate assessment of fraud detection performance. Additionally, we detail our training and test datasets, outlining the differences, and discuss our processes.
We utilize three publicly available Medicare datasets maintained by the CMS: Part B, Part D, and DMEPOS [46][47][48]. CMS is the Federal agency within the US Department of Health and Human Services that manages Medicare, Medicaid, and several other health related programs. These Medicare datasets are derived from administrative claims data for Medicare beneficiaries enrolled in the Fee-For-Service program, where all claims information is recorded after payments are made [49][50][51], and thus we assume these datasets are already reasonably cleansed. We employ all years currently available for all three parts, where Part B is available for 2012 through 2016 and Part D and DMEPOS are available for 2013 through 2016. The Part B dataset provides claims information for each procedure a physician performs. The Part D dataset provides information pertaining to the prescription drugs they administer under the Medicare Part D Prescription Drug Program. The DMEPOS dataset provides claims information about medical equipment, prosthetics, orthotics, and supplies that physicians referred patients to either purchase or rent from a supplier. Physicians are identified using their unique NPI established by CMS [26]. Part B and DMEPOS have all procedures labeled by their Healthcare Common Procedure Coding System (HCPCS) code [52], whereas Part D has each drug labeled by its brand and generic name. We create a training and test dataset for each Medicare part. The training datasets encompass all available years of Medicare data prior to 2016, for each Medicare part, by appending each annual dataset, aggregated over matching features. The test datasets were created using the same process, but only include the latest 2016 data. We also develop training and test combined datasets, which integrate features from all three Medicare parts, joined by NPI, provider type, and year (excluding the gender variable from DMEPOS). We develop the combined dataset because, in practice, a physician could submit claims to multiple Medicare parts with no dependable way of determining within which part a physician will target their fraud behavior. Through combining information relating to procedures, drugs and equipment, we are utilizing a more encompassing view of a physician's behavior for machine learning. One limitation to combining Medicare datasets is that it is only applicable to physicians who submit claims to multiple Medicare parts. The combined training dataset consists of the years 2013 through 2015, while the test set is only 2016.
From these Medicare datasets, we select the features specifically related to claims information and a select physician-specific data points, as we believe they provide value and are readily usable by machine learning models. Table 1 demonstrates the features chosen for our study. Note the exclusion feature is generated through mapping to the LEIE, creating the fraud or non-fraud labels for classifying physicians. We excluded repetitious features including physician names, addresses, or code descriptions as they provide no extra value. We also did not include several features containing missing or constant values. NPI was used for identification purposes but not for building the models, and other features, such as Medicare participation, were used for data filtering. Also features, like standardized payments and standard deviation values, are removed since they are not present in all of the Medicare years. Details on all of the available Medicare features can be found in the "Public Use File: A Methodological Overview" documents, for each respective dataset, available at [49][50][51].
The LEIE was established and is maintained by the Office of Inspector General (OIG) [53] under the authority of Sections 1128 and 1156 of the Social Security Act [6]. The LEIE [54] contains information such as reason for exclusion, date of exclusion, and reinstate/waiver date for all current physicians who violated established rules and were found unsuitable to practice medicine. The LEIE, unfortunately, contains the NPI values for only a small percentage of physicians and entities within its database, contributing to the large class imbalance found after fraud labels are added to the Medicare datasets. We note that 38% of providers convicted of fraud continue practicing medicine and 21% of providers with fraud convictions were not suspended from practicing medicine, despite being convicted [55]. There are different categories of exclusions, based on severity of offense. As shown in Table 2, we chose only the mandatory non-permissive exclusions. We use these excluded providers as fraud labels in all of our training and test datasets. Note, the LEIE does not provide within which program a physician perpetrated their offenses (i.e. Medicare), meaning these excluded physicians were not necessarily convicted for committing criminal activities within Medicare, but we assume that a physician who commits such acts would continue their fraudulent behavior when submitting claims to Medicare. The LEIE is aggregated at the provider-level (i.e. a single recorded exclusion per provider by NPI) and does not contain information regarding procedures, drugs or equipment related to fraudulent activities. Therefore, we transform the Medicare data to the provider-(or NPI-) level, and then map LEIE exclusion labels (fraud and non-fraud) to each Medicare dataset by NPI and year. The transformation process consists of grouping the data by provider type (such as Cardiology), NPI, and gender (if available), and aggregating over each procedure/drug name and place of service/rental type. In order to minimize information loss from aggregation, we generate additional numeric features for each original numeric feature located in Table 1, including: mean, sum, median, standard deviation, minimum, and maximum. In preparing our categorical variables for data mining, we utilized one-hot-encoding, which creates new binary features for each option within a categorical variable, assigning a one or zero based on membership. Physicians are labeled as fraudulent for claims within their exclusion period and the period prior to their recorded exclusion start date. The reason we decided to include claims submitted during the exclusion period is that these are payments that should not have been fulfilled by Medicare, and could be considered fraudulent per the federal False Claims Act (FCA) [11]. We include claims prior to the exclusion start date due to these potentially consisting of the fraudulent activities that resulted in the physician being placed on the LEIE, including criminal convictions, patient abuse or neglect, or revoked licenses. Table 3 summarizes each dataset used in our study, detailing the number of features, number of fraudulent and non-fraudulent instances, and the percentage of fraudulent cases after aggregation, one-hot-encoding and fraud labeling. The main difference between the training and test datasets are in the provider type labels within the 2016 CMS datasets, as they were either entered incorrectly, slightly different, or completely changed. We adjusted as many of these these provider type labels as possible when processing the 2016  Obstetrics and Gynecology to Obstetrics/Gynecology, Radiation Therapy Center to Radiation Therapy, and Oral Surgery (Dentists only) to Oral Surgery (dentists only). In addition, there were other provider types that were added or removed, which we were unable to match between the training and test datasets. To the best of our knowledge, there is no documentation discussing differences in provider types between years. For the Train_Test evaluation method, we removed any non-matching physician type from both the training and test datasets (after one-hot encoding), such as Hospitalist which was added in 2016 and Physical Therapist which was removed starting in 2016. The removal of these nonmatching physicians does present some information loss in the Train_Test results, but we believe there is no significant impact on the results as this is a minimal modification. We document these non-matching provider types in Table 7 in Appendix A.

Severe class imbalance and rarity
Having a large difference between the number of majority and minority class instances can create bias towards the majority class when building machine learning models. This is known as class imbalance [56], which presents issues for machine learning algorithms when attempting to discriminate, often complex, patterns between classes, particularly when applied to Big Data. Rarity is an exceptionally severe form of class imbalance. In real-world situations, when severe class imbalance and rarity are present, the minority class is generally the class of interest [2]. When employing Big Data in machine learning, severe class imbalance and rarity exhibit a large volume of majority class instances, increased variability, and disjuncts. Small disjuncts are associated with issues, such as between-and within-class imbalance [1] and [57]. Generally, a learner will provide more accurate results for large disjuncts, which are created based on a large volume of instances. Large disjuncts can overshadow small disjuncts, leading to overfitting and misclassification of the minority class due to the under-representation of subconcepts [58]. Table 4a demonstrates the level of class imbalance present in each Medicare dataset, split by year. We observe that the number and percentage of fraudulent instances matching between the LEIE and the Medicare datasets decreases every year, across each dataset. There are a few possibilities explaining this decrease, including the continued efforts to remove fraudulent physicians from practice, fraudulent physicians more efficiently avoiding detection, law enforcement shifting focus from physician fraud, or the deterrent effect of technological advances in fraud detection. We also note from the labeled Medicare datasets that each year, the non-fraudulent instances generally increase at a faster rate than the fraudulent cases are decreasing. These two observances are pushing the imbalance in fraud instances from severe to rarity. Therefore, rarity is an important topic to study in Medicare fraud detection, and in order to study rarity, we generate additional training datasets as shown in Table 4b. All non-fraudulent instances are kept, while we remove a number of fraudulent instances, achieving further levels of severe class imbalance and rarity. The PCCs in these new generated datasets range from 1000 to 100, based on original number of fraudulent instances. These PCCs were further chosen, based on preliminary results, which demonstrate that these adequately represent class rarity in Big Data. In order to get a thorough representation of fraudulent instances, we generate ten different datasets by re-sampling for each dataset/PCC pair. For example, with regard to the 200 PCC for the Part D, we randomly select 200 instances from the original 1018, with this process repeated ten times. The final result for each dataset/PCC pair is the average score across all ten generated rarity subsets.
Data sampling is used to minimize the effects caused by severe class imbalance and rarity, from which there are two main branches: oversampling and undersampling. Oversampling generates new minority instances while undersampling removes majority instances. The goal of data sampling is adjusting the datasets to a given ratio of majority and minority representation. Oversampling has a few disadvantages, including decreased model generalization due to its process of duplicating existing minority class instances [59] and the increased processing time due to these additional instances. For these reasons and based on our prior research where oversampling has been shown to decrease fraud detection [60], we select RUS. The main drawback of RUS is the potential removal of useful information, but it is beneficial when applied to Big Data as removing instances decreases both required computing resources and build time, as well as being supported by [38] and [56]. As we employ RUS, our goal is to incur minimal information loss while simultaneously removing the maximum number of majority instances (i.e. determine which ratio delivers the best fraud detection). Therefore, we chose the following class ratios: 1:99, 10:90, 25:75, 35:65, and 50:50 (minority:majority), including the full, non-sampled datasets as the baseline (labeled as Full). In applying these ratios, we generate ten datasets for each original and generated training dataset, to reduce bias due to poor random draws. These ratios were chosen because they provide a good distribution, ranging from balanced 50:50 to highly imbalanced 1:99 [61]. Note, for Train_Test, when applying RUS or creating the severe class imbalance and rare subsets, only the training datasets are sampled, as they build the model, while test datasets are kept unaltered for model evaluation.

Train_Test and Cross-Validation
In this study, we employ the Train_Test evaluation method, which uses a training dataset for building the model, and evaluate this model using a separate, distinct test dataset, as demonstrated in Fig. 1a. Instances from the test dataset are completely new, and never used for model building. Examining the Train_Test method's performance is necessary for assessing whether, based on past occurrences, a model can accurately predict new occurrences. Through our experimentation, Train_Test will determine, based on prior Medicare data (years < 2016), whether a physician can be accurately classified as fraudulent or non-fraudulent given Medicare data from the most recently released 2016 datasets. We are also assessing how rarity effects the Train_Test method's results, which is especially important since known fraudulent instances are decreasing year-over-year. Additionally, we also use CV in order to compare results and determine whether employing CV estimates are similar to results from the Train_Test evaluation method. For CV, we use training datasets that were not altered to match the test datasets, as CV does not employ the test dataset. CV is very popular among the Data Mining and Machine Learning community [62] as an evaluation method for prediction performance in almost every application domain, and can be useful when a researcher only has access to prior data. We are performing this comparison with CV due to its popularity and potential drawbacks versus the Train_Test method. Rao et al. [28] recommend validating CV results with a separate test dataset. They also mention that when a model is tuned by a test dataset, this is no longer an accurate simulation of the real-world event. A few other drawbacks of CV, as found in the literature, are CV can result in large errors using small sample sizes [63], the error introduced by bias or variance [29,64], and CV being vulnerable to high levels of variability. Therefore, by employing the 2016 test datasets with Train_Test, we are evaluating the viability of CV for providing estimates that lead to model selection in Medicare fraud detection. As mentioned, there are a number of different versions of CV and for this study, we chose stratified k-fold CV with k = 5. k-fold CV evenly splits a dataset into k-folds, allowing a learner to evaluate a single dataset (training datasets). The model is then built on (k − 1)fold and evaluated on the remaining fold. This is repeated until each fold has been used for evaluation. The results from each repeat are averaged together to give the final result. The process for k-fold CV is demonstrated in Fig. 1b. This allows that every instance located in the training dataset will be used for both model building and evaluation. The process of stratification is important when applying CV to highly imbalanced data, and even more so with rare data, which could result in a fold without a single fraudulent instance. Stratification ensures that all k-fold have approximately the same ratio of class representation as the original data. To avoid any bias caused by bad random draws when creating folds, we repeat the CV process 10 times for each learner/dataset pair. The final detection performance score is the average of all 10 CV repeats.

Methods/experimental
In this section, we discuss the learners (machine learning models), as well as the performance metric and significance testing which will be used to evaluate the influence of severe class imbalance and rarity on Medicare fraud detection. Since our Medicare datasets have such a large volume of data, we required a machine learning network that can handle Big Data. Therefore, we employ Apache Spark [65] on top of a Hadoop [66] YARN cluster for running and validating our models using their implemented MLlib. Apache Spark is a unified analytics engine capable of handling Big Data, offering dramatically quicker data processing over traditional methods or other approaches using MapReduce. The MLlib provided by Apache is a scalable machine learning library built on top of Spark.

Learners
From Apache Spark 2.3.0 [67] MLlib [68], we chose LR [69], and two tree-based models: RF and Gradient Tree Boosting (GTB) [70]. As of this study, Spark's MLib has eight available classifiers. We chose these three based on preliminary research where other learners provided relatively worse fraud detection, such as Multilayer Perceptron or Naive Bayes. We used default configurations for each learner, unless noted otherwise. In Appendix B, we provide detailed descriptions for each learner, as well as indicate any configuration modifications.

Performance metric
In order to evaluate the fraud detection performance of each learner, we use the Area Under the Receiving Operator Curve (ROC) Curve (AUC) [71,72]. AUC has demonstrated itself quite capable as a metric for quantifying results for machine learning studies employing datasets with class imbalance [73]. AUC shows performance over all decision thresholds, representing the ROC curve as a single value ranging from 0 to 1. An AUC of 1 denotes a classifier with perfect prediction for both the positive and negative classes, 0.5 represents random guessing, and any score under 0.5 means a learner demonstrated predictions worse than random guessing. The ROC curve is a plot comparing false positive rate (1 − specificity) against true positive rate (sensitivity), and is commonly used to visually represent binary classification results. The false positive rate is calculated by FP FP+TN and measures the number of negative instances (nonfraud) incorrectly classified as positive (fraudulent) in proportion to the total number of instances labeled as negative, also known as a false alarm rate. True positive rate is calculated by TP TP+FN and measures the number of positive instances correctly classified as positive in proportion to the total number of instances labeled as positive. This relationship between (1 − specificity) and sensitivity, as portrayed in the ROC curve, illustrates a learner's capability to discriminate between both classes.

Significance testing
We perform hypothesis testing to demonstrate the statistical significance around our AUC results through ANalysis Of VAriance (ANOVA) [74] and Tukey's HSD tests [75]. ANOVA is a statistical test determining whether the means of several groups (or factors) are equal. Tukey's HSD test determines factor means that are significantly different from each other. This test compares all possible pairs of means using a method similar to a t-test, where statistically significant differences are grouped by assigning different letter combinations (e.g. group 'a' is significantly better than group 'b' in correlation to the issue). Both ANOVA and the Tukey's tests explore the differences between the following factors: datasets, learners, PCCs, and class ratios.

Discussion and results
In this section, we discuss our experimental results, assessing the impacts of rarity on Medicare fraud detection, as well as provide recommendations for practitioners based on these results. Table 5 presents the average AUC scores for Train_Test across PCC, consisting of original class distribution (All) and the selected severe class imbalance and rarity values, split by dataset (sub-tables), learner, and class ratio. The boldfaced values indicate the learner/ratio pair producing the best fraud detection performance per PCC. The effects of class rarity are demonstrated across each dataset, where the boldfaced values decrease as the PCCs decrease, and persist across nearly every learner/ratio pair. We notice that across all datasets, LR frequently presents the best scores, where the only outlier is DMEPOS, with GBT having better results for higher PCCs. Even though DMEPOS demonstrates better results with tree-based learners, we observe that as PCC decreases, LR begins to have better results. We believe that LR's results are due to a more successful strategy for handling class imbalance and rarity through regularization, which penalizes large coefficients (Ridge Regression) minimizing the adversities of noise and overfitting, leading to increased model generalization. Both GBT and RF also employ mechanisms to curtail the effects of noise and overfitting, but appear less robust to class imbalance and especially rarity, for Medicare fraud detection. Among the boldfaced values, we notice the less balanced ratios have the highest scores. Upon closer inspection, the 10:90 ratio most frequently scores higher across PCC/ratio pairs followed closely by 1:99 and Full, especially for the combined dataset. Note that the Full (non-sampled) results indicate good detection performance, again, showing that a good representation of the majority class is beneficial. We believe the diminishing results when approaching a balanced configuration are due to the removal of too many negative class instances, deterring the learner's ability to discriminate the details of the non-fraudulent class. The combined dataset has higher scores compared to the individual datasets, as indicated by the boldfaced values. Possible contributing factors are that the combined dataset contains a larger selection of attributes, which facilitates a broader view of physician behavior over the individual parts, and that it only concentrates on providers who submitted claims to all three parts of Medicare. Additionally, we perform this same experiment for Train_CV, and provide the results in Table 8 in Appendix C. Figure 2 presents multiple bar graphs, comparing the differences in average AUC scores between Train_Test and Train_CV across each learner, dataset, PCC, and ratio configuration, where each bar represents the average AUC score for Train_Test minus the average AUC score for Train_CV. These bar graphs demonstrate that Train_Test outperforms Train_CV in almost all cases, the only notable contradiction being for the Part B dataset when employing the tree-based learners, in particular RF. We note that Train_Test had superior results for every configuration when employing LR. This signifies that the models built using previous years (training dataset) and evaluated on a separate, new year (test dataset) provide better fraud detection over applying CV on the training dataset alone. As mentioned above, CV is susceptible to bias and variance, which could contribute to the moderate results compared to Train_Test. We observe that as PCC decreases, becoming more rare, the delta between Train_Test and Train_CV generally increases. We surmise that Train_Test handles imbalanced data and class rarity better due to the models being trained with all available positive class (fraudulent) instances. Thus, with Train_Test, there is a decreased chance of overfitting, compared to CV, where models are built using a sub-sample of instances in each fold, bringing the already small PCC even lower for each training dataset. Additionally, we performed hypothesis testing to demonstrate the significance of our results. We used a one-factor ANOVA test for Evaluation Method (Train_Test and Train_CV), and assess significance over learners, datasets, ratios and PCC as shown in Table 9c in Appendix C. Evaluation method was significant at a 95% confidence interval, and therefore, we further conducted a Tukey's HSD test, presented in Table 6, to determine the significance between fraud detection results garnered from Train_Test and Train_CV. The Tukey's HSD test placed Train_Test in group 'a' and Train_CV in group 'b' signifying that evaluating a model on a segregated test set provides significantly better results over building and evaluating a model through CV.
Even though the Tukey's HSD test determined the evaluation methods are one group apart, we can argue that CV provides comparable results to Train_Test, albeit conservative. Therefore, we present the following results for both Train_Test and Train_ CV in order to provide a thorough claim as to which learner and ratio yield the best results for class imbalance and rarity, as well as provide further insight into comparing these evaluation methods. We perform a 4-factor ANOVA test for both Train_Test and Train_CV, in Table 9a and b (in Appendix C), and evaluate the differences between datasets, learners, class ratios and PCCs. All factors and their interactions are shown as significant, at a 5% significance level. We perform further Tukey's HSD tests for each PCC, and assess any significant differences for learners (across ratios and datasets) and ratios (across learners and datasets), shown in Fig. 3a and b, respectively. We concentrate only on the results for the best scoring group (group 'a'). For these graphs, there can be multiple combinations within a group designation, such as in Fig. 3a, for the Part D dataset with 200 and 400 PCC, Train_Test and Train_CV with a class ratio of 10:90, are each in group 'a' , respectively. The results in Fig. 3a, show that LR contains the vast majority of group 'a' members, while Train_Test and Train_CV both have almost an identical distribution with the same number of group 'a' members. Therefore, regardless of model configuration across all PCCs, LR is able to provide the highest levels of discernment between fraudulent and non-fraudulent behavior patterns within each of our Medicare datasets. Figure 3b shows that the less balanced ratios contain the majority of group 'a' membership, with 10:90 having more representation than all other ratios combined. The more balanced ratios have significantly less group 'a' representation, where 50:50 has zero members. As seen with the learner results, Train_Test and Train_CV have similar distributions. However, Train_Test handles the more balanced datasets better, which is potentially due to the fact that Train_Test employs the entire training datasets whereas Train_CV splits the data, minimizing the already small fraud and non-fraud instances. Overall, from these results, we observe that for PCC, although the rarity experiments have similar group 'a' representation compared to the original class distribution, the overall AUC scores diminish as the level of rarity is increased. The main difference between evaluation methods from the learner and ratio Tukey's test is that the Train_ Test generally has higher average AUC scores over comparable configurations. The complete Tukey's HSD results for all configurations for both learners and ratios are listed in Tables 10 and 11 in Appendix C.
In summary, we assessed the effects that class imbalance and rarity have on Medicare claims data, and compared how various machine learning techniques handle detecting fraudulent behavior when being subjected to these effects. Machine learning was able to improve results, with RUS, for the majority of the class imbalance and rarity experiments presented in this work. However, we found that the rarer fraudulent instances become, the less machine learning can effectively discern fraudulent behavior from nonfraudulent behavior. Therefore, if the PCC qualifies a Medicare claims dataset as rare, we recommend that a practitioner gather additional, quality data until there is a sufficient PCC. The Train_Test and Train_CV method, in general, had similar results, but the latter's results were somewhat conservative in comparison. We recommend practitioners

Conclusion
Two significant challenges facing healthcare fraud detection are the large amounts of data generated (Big Data) and the significant imbalance in fraudulent versus non-fraudulent behavior (class imbalance). The combination of these two issues leads to datasets that contain an extremely large volume of negative class instances (non-fraudulent) and very small numbers of positive class instance (fraudulent). In the case of fraud detection, the data is severely imbalanced. We focus our study on three 'Big Data' datasets released by CMS, specifically Part B, Part D, and DMEPOS (individually and combined), as well as the LEIE from the OIG in order to map our real-world fraud labels. We notice that the fraudulent physicians from the LEIE had less matching physicians between the Medicare datasets, each year, since CMS started releasing these datasets. Because of this, we experimented with further severe class imbalance leading into class rarity. We do this by generating additional datasets and randomly remove fraudulent instances in order to determine the effects of increasing rarity on real-world fraud detection performance (Train_Test). In order to minimize the effects of severe class imbalance and rarity, we also employ data sampling, with various class ratios. In applying RUS, we created a new dataset for each original, severe class imbalanced and rare dataset. Detecting fraudulent behavior is the first step towards eliminating, or at least minimizing, fraud in healthcare, which would allow programs such as Medicare the ability to provide medical funding to a larger number of beneficiaries in the United States. Throughout our study, we employ three learners and assess model performance using AUC and significance testing. When utilizing the Train_Test evaluation method for severely imbalanced and rare datasets, we recommend building the model with LR and applying RUS with a 10:90 ratio. We noticed that as ratios approached balance (i.e. 50:50), performance decreased, and as such, determine that larger non-fraudulent representation is beneficial, with 10:90 being optimal. In practice though, a separate test dataset to evaluate a machine learning model may not be available due to a shortage of positive cases or lack of new data, and thus requires the use of other methods. To address this, we re-ran all experiments with CV, using the training datasets. CV emulates the Train_Test method, providing model generalization and error estimates on a single dataset by sub-setting the dataset into smaller training and test datasets, allowing all instances to both build and evaluate performance. We found that Train_Test results were significantly better than CV, but we determine that CV can be a reliable substitute, when necessary, but a practitioner should keep in mind that results will be conservative. CV also showed similar patterns to Train_Test in terms of observed effects due to severe class imbalance and rarity, as well as the improvement garnered upon applying RUS. Overall, we noticed that prediction performance decreased as the number of fraudulent instances trended towards rarity, and therefore, we recommend that when PCC becomes too small (rare), then a practitioner should search for more quality data in order to appropriately allow for proper discrimination between fraudulent and non-fraudulent instances when applying machine learning. Future work will consist of employing other Big Data sources from other branches of Medicare or other healthcare programs, including misclassification costs, and determining methods for obtaining more quality real-world fraudulent physicians. Authors' contributions MH and RAB performed the primary literature review, experimentation and analysis for this work, and also drafted the manuscript. TMK worked with MH to develop the article's framework and focus. TMK introduced this topic to MH, and to complete and finalize this work. All authors read and approved the final manuscript.