Data reduction techniques for highly imbalanced medicare Big Data

Hancock, John T.; Wang, Huanjing; Khoshgoftaar, Taghi M.; Liang, Qianxin

doi:10.1186/s40537-023-00869-3

Research
Open access
Published: 03 January 2024

Data reduction techniques for highly imbalanced medicare Big Data

John T. Hancock¹,
Huanjing Wang²,
Taghi M. Khoshgoftaar¹ &
…
Qianxin Liang¹

Journal of Big Data volume 11, Article number: 8 (2024) Cite this article

2009 Accesses
3 Citations
90 Altmetric
Metrics details

Abstract

In the domain of Medicare insurance fraud detection, handling imbalanced Big Data and high dimensionality remains a significant challenge. This study assesses the combined efficacy of two data reduction techniques: Random Undersampling (RUS), and a novel ensemble supervised feature selection method. The techniques are applied to optimize Machine Learning models for fraud identification in the classification of highly imbalanced Big Medicare Data. Utilizing two datasets from The Centers for Medicare & Medicaid Services (CMS) labeled by the List of Excluded Individuals/Entities (LEIE), our principal contribution lies in empirically demonstrating that data reduction techniques applied to these datasets significantly improves classification performance. The study employs a systematic experimental design to investigate various scenarios, ranging from using each technique in isolation to employing them in combination. The results indicate that a synergistic application of both techniques outperforms models that utilize all available features and data. Moreover, reduction in the number of features leads to more explainable models. Given the enormous financial implications of Medicare fraud, our findings not only offer computational advantages but also significantly enhance the effectiveness of fraud detection systems, thereby having the potential to improve healthcare services.

Introduction

Data reduction techniques for the classification of highly imbalanced Big Data are desirable since they may improve performance, and smaller data sizes generally lead to faster model training times and therefore accelerate research. We systemically investigate the application of Random Undersampling (RUS) and our novel ensemble supervised feature selection technique to the Machine Learning task of Medicare insurance fraud detection. Both feature selection and RUS are data reduction techniques. We perform experiments on two imbalanced Big Medicare Datasets. In the experiments, the data reduction techniques are applied alone, and in combination. The statistical analysis of the experimental outcomes indicates that the data reduction techniques, in combination, yield the best performance in terms of Area Under the Precision Recall Curve (AUPRC) [1]. Furthermore, that performance is significantly better than using all features. We prefer to report results in terms of AUPRC since it is threshold agnostic, and the only other widely used, threshold-agnostic metric, Area Under the Receiver Operating Characteristic Curve (AUC) [2], is shown in previous research to be misleading for evaluating classification of imbalanced Big Data [3, 4]. Our contribution is to show that intelligent data reduction techniques improve the classification of highly imbalanced Big Medicare data.

Medicare is the United States’ public health insurance program. Its mission is to provide insurance for people aged 65 and over. It is important to note that Medicare is sporadically compromised by fraudulent insurance claims. These illicit activities often go undetected, allowing unscrupulous healthcare providers to exploit weaknesses in the system. The Department of Justice managed to reclaim approximately $3 billion dollars from such fraudulent activities in 2019, as cited in their recovery report [5]. However, it is essential to recognize that this figure only represents a fraction of the total monetary loss, the full extent of which remains indeterminate. In 2019, the Centers for Medicare & Medicaid Services (CMS) estimated that improper payments, a category that includes both fraudulent and erroneous payments, amounted to roughly $100 billion [6]. Therefore, automated Medicare fraud detection has the potential to discover more fraudulent activity.

In the application domain of fraud detection, Machine Learning aids in pinpointing the small percentage data related to fraudulent activity in a vast sea of Big Medicare data. Since identification of fraud is the first step in stopping it, Machine Learning techniques may conserve substantial resources for the Medicare system by preventing fraud. In our study, we compile Medicare insurance claims datasets from several sources. The sources originate with the CMS. Furthermore, we label the datasets with the List of Excluded Individuals and Entities (LEIE). The LEIE is provided by of the United States Office of the Inspector General [7].

The performance of a classifier can be swayed by multiple effects. Two factors that can make data more difficult to classify are dimensionality, and class imbalance. Class imbalance in labeled data happens when the overwhelming majority of instances in the dataset have one particular label. This imbalance presents obstacles, as since it is possible for a classifier, optimized for a metric such as accuracy, will mislabel fraudulent activities as non-fraudulent to boost overall scores in terms of the metric. Yet, data reduction techniques provide promising avenues for addressing these challenges in Big Data. These methods, if properly applied to detect and stop Medicare insurance fraud, could substantially elevate the standard of healthcare service. This would be made possible by reducing costs related to fraud.

To address the challenges of imbalanced Big Data and high dimensionality, feature selection and data sampling are often utilized as initial data preparation steps. On one hand, data sampling is adopted to tackle the issue of class imbalance. It entails adjusting the training dataset by adding or subtracting examples to ensure a more even balance between fraudulent and non-fraudulent entries. On the other hand, feature selection, which addresses high dimensionality, focuses on choosing a specific group of attributes from the training data, and only these chosen attributes are used to construct the final model. This not only streamlines the learning process but can also enhance classification accuracy by discarding less relevant attributes.

The data sampling technique we use is RUS. RUS is a straightforward yet potent data sampling technique. It also has the added benefit of data reduction. RUS works by randomly removing samples from the majority class until a specific balance between the minority and majority classes is met. A crucial point to emphasize is that we applied RUS solely to the training datasets. In one set of experiments, we employ RUS to generate datasets with several minority to majority class proportions, such as 1:1, 1:3, 1:9, 1:27, and 1:81. It is also worth mentioning that we conduct experiments where the training datasets are left at their original class ratios. Following this approach, we developed six classification models from the datasets used in this study. This research enabled us to delve deeper into the influence of RUS on the efficiency of models, with an emphasis on how class ratios affect experimental outcomes.

For feature selection, we incorporated a supervised feature selection method based on feature ranking lists. Subsequently, through the implementation of an innovative approach, these lists are combined to yield a conclusive feature ranking. Upon the derivation of this consolidated ranking, features are selected based on their position in the list. Specifically, we select subsets comprising a top number of features, as dictated by their rank. Using these subsets, we build additional classification models. To furnish a benchmark, models were also built utilizing all features of the datasets. This systematic approach granted us a deeper comprehension regarding the interplay between feature selection and model robustness within the context of multiple learning algorithms.

In further experiments, we move on to combine data sampling and feature selection. The order of data sampling and feature selection is important, and different techniques are implemented in order to investigate the effect of applying the data reduction techniques in different orders. Therefore, an additional contribution of this study is a detailed explanation of how the techniques are applied in four distinct scenarios: RUS only, feature selection only, feature selection followed by RUS, and RUS followed by feature selection. There is an additional fifth scenario incorporated in experiments, where we do not apply data reduction techniques as a control. The remainder of our study is organized in the following sections: Related Work, Datasets, Classifiers, Methodology, Results, Statistical Analysis, and Conclusions. Following the conclusions, we include an appendix with a statistical analysis of additional experiments.

Related work

In the context of health insurance fraud detection, this section reviews existing literature that explores the classification of data with high dimensionality which is also highly imbalanced. Techniques such as data sampling and feature selection are commonly employed to mitigate the challenges associated with this specialized type of data. In our review of existing literature, we found an opportunity to contribute a study on the effect of RUS and feature selection. Since our supervised feature selection technique is novel, previous work does not cover it. Moreover, we did not find a study that provides the comprehensive analysis of the impact of feature selection techniques and sampling techniques on the performance of insurance fraud detection with Machine Learning models.

Sateesh et al. present a supervised learning framework for fraud detection in Big Medicare Data [8]. In their study, they use publicly available Medicare claims data from 2012 through 2015. We use data spanning a larger number of years, from 2013 through 2019. Sateesh et al. utilize the same LEIE dataset for labeling their dataset that we use here. They compile a highly imbalanced dataset with over 37 million records. They employ RUS to induce multiple class ratios. Sateesh et al. use only three classifiers: Decision Tree, Logistic Regression and Support Vector Machine, whereas we use six. Furthermore, they employ multiple performance metrics including AUC, false positive rate (FPR) and false negative rate (FNR) to assess experimental outcomes. However, as stated previously, we find AUC to be a misleading metric for evaluating the performance of classifiers on highly imbalanced, Big Data. We prefer not to publish results in terms of FPR and FNR since these are threshold-dependent metrics. Sateesh et al. find that the Decision Tree and Logistic Regression classifiers perform the best overall. Particularly they found that with the 1:4 class ratio, Decision Tree has the lowest FNR, which they deem critical for detecting fraud. We maintain that threshold-agnostic metrics, such as AUPRC, give a more informative picture of overall classifier performance. While Sateesh et al.’s findings are in alignment with ours as far as data reduction techniques improving performance, our study is far more extensive, since we use five ensemble classifiers, whereas they employ three single-learner classifiers. Another key element we found missing from Sateesh et al. is that they do not report results for experiments conducted at the original class ratio, so it is unclear whether their application of RUS yields better results than leaving the data at the original class ratio. More importantly, we consider an additional data reduction technique, feature selection, and we demonstrate a precise methodology for applying the data reduction techniques in a manner that is not covered by Sateesh et al. Our study extends beyond the work done by Sateesh et al. because we also investigate the combination of RUS and feature selection, and we provide a detailed exposition of our experimental methodology for combining data reduction techniques.

Focusing on the task of identifying Medicare fraud, Hancock et al. [4] employed datasets from the CMS, which varied considerably in size, ranging from approximately 12 million to 175 million instances. All the datasets exhibited a pronounced class imbalance, prompting the use of RUS to induce the minority-to-majority class ratios 1:1, 1:3, 1:9, 1:27, and 1:81. In their methodology, five ensemble classifiers were used, and their performance was assessed using both AUC and AUPRC metrics. The study revealed that irrespective of the degree of RUS applied, the classifiers produced high AUC but low AUPRC scores. Hancock et al. emphasized the superiority of the AUPRC metric over AUC for gauging the efficacy of classifiers on imbalanced datasets. However, their findings also indicated that RUS had an adverse impact on AUPRC scores. We believe the improvement in AUPRC scores we find when applying RUS to our data is the result of our data preprocessing step of aggregation, which Hancock et al. do not employ. Another significant difference between our studies is that Hancock et al.’s study did not involve any feature selection methodology. Not only do we cover feature selection, but also we cover RUS, and we explain the various ways RUS and feature selection can be combined. Furthermore, we provide a statistical analysis on the impact of the variations of the techniques on experimental outcomes.

In a study that introduces a novel deep learning architecture for Big Medicare Data fraud detection, Mayaki and Riveill [9] propose MINN-AE. MINN-AE is a specialized Medicare fraud detection model featuring a multiple-input deep neural network supplemented by a Long-short Term Memory autoencoder component. The model’s efficacy was evaluated against nine baseline models, which included traditional Machine Learning algorithms like Logistic Regression and Random Forest, as well as various forms of artificial neural networks. Evaluation metrics employed in the study included geometric mean, precision, AUC, and AUPRC. The authors underscored the significance of the AUPRC metric over the AUC metric, particularly when dealing with highly imbalanced datasets. We were unable to locate discussion of feature selection in Mayaki and Riveill’s study. Hence, Mayaki and Riveill’s study is another example of a study that does not combine feature selection and sampling techniques in the Machine Learning task of fraud detection. In our study, we portray four distinct scenarios for applying data reduction techniques, and two of the scenarios involve multiple data reduction techniques.

Similar to our study, Herland et al. use publicly available data from the CMS, in a study on Medicare fraud detection [10]. Their study employs a diverse set of Medicare data, including Part B data spanning 2012 to 2015, Part D data covering 2013 to 2015, and Medicare Durable Medical Equipment, Devices & Supplies (DMEPOS) [11] data from the same years. In addition to these three datasets, they create a fourth, dubbed the Combined dataset, by merging the aforementioned datasets. Their methodology involves comparing the performance of three classification models. These are Logistic Regression, Random Forest, and Gradient Boosted Trees. The Combined dataset, when paired with a Logistic Regression model, delivers the best performance in fraud detection, according to their results. However, they do not use any of the data reduction techniques we use in this study. Unlike Herland et al., we introduce a novel feature selection technique. Furthermore, their results are exclusively reported in terms of the Area Under the Curve (AUC) metric, a measure we find unsatisfactory for evaluating classifiers on imbalanced Big Data sets. This serves as a key point of departure between our research and that conducted by Herland et al. Finally, our study also expands the range of classifiers used, since we include results from six classifiers, two of which are representatives of the Bagging Family, and three of which are Gradient Boosted Decision Tree implementations.

Lopo and Hartomo [12] compare multiple sampling techniques: RUS, Random Oversampling (ROS), Synthetic Minority Over-sampling Technique (SMOTE) [13], and Instance Hardness Threshold (IHT), for addressing class imbalance in healthcare insurance fraud detection. Using a real-world Indonesian healthcare dataset with over 2 million records and a 6:94 class imbalance ratio, sampling methods are applied to induce 1:1, 3:7 and 1:9 class ratios. Lopo and Hartomo use one classifier, XGBoost, in conjunction with the sampling techniques. Multiple evaluation metrics including AUC, AUPRC are used to evaluate experimental outcomes. Results indicate RUS and ROS perform best with 1:1 distribution, while SMOTE and IHT are more effective at the 3:7 and 1:9 class ratios, respectively. SMOTE at the 3:7 distribution level demonstrates consistently high scores across all metrics. Longer computation times are a trade-off of SMOTE and IHT. Key predictive features identified include costs, diagnoses codes, healthcare service types, gender and disease severity. While providing insights on sampling techniques for imbalanced healthcare data, limitations of a single dataset and classifier indicate opportunities for the research we document here. We use multiple data sources and additional algorithms. In this study we use two datasets, and six learners. Moreover, we investigate the combination of RUS, and feature selection, whereas Lopo and Hartomo only perform experiments with sampling techniques.

In their investigation into the classification of imbalanced Medicare Big Data, Johnson and Khoshgoftaar apply Deep Learning algorithms and evaluate performance using Geometric Mean and Area Under the Curve (AUC) metrics [14]. Employing a dataset of approximately five million instances and increasing levels of RUS, their study finds that performance metrics deteriorate when the minority class exceeds one percent of the training data. Our research diverges from Johnson and Khoshgoftaar’s in several significant ways. Firstly, we use AUPRC for evaluation of experimental outcomes, in contrast to their use of Geometric Mean And AUC. Secondly, Johnson and Khoshgoftaar do not employ feature selection techniques as we do here. Hence, their study does not include a detailed exposition of the methodologies available for combining feature selection and RUS. A third major difference in our studies is the learners we employ. Johnson and Khoshgoftaar employ neural networks, one type of Bagging classifier, Random Forest, and one type of Gradient Boosting Decision Tree classifier. Here, we use three instances of Gradient Boosting Decision Tree classifiers, and two types of Bagging classifiers. These methodological choices create a distinct difference between our study and the research conducted by Johnson and Khoshgoftaar.

Hasanin et al. investigate RUS and feature selection for addressing class imbalance in bioinformatics Big Data [15]. The Evolutionary Computation for Big Data and Learning 2014 (ECBDL’14) competition dataset with approximately 32 million records is used in their study. The dataset is imbalanced, and has numerous features. Feature selection, based on feature importance with a single learner, Random Forest, is used to do feature selection. Here, we use a more sophisticated feature selection technique that involves six learners. Hasanin et al. employ RUS to address class imbalance. Random Forest, Logistic Regression and Gradient Boosted Trees are evaluated using true positive rate (TPR), true negative rate (TNR) and their product. Their study demonstrates RUS with feature selection can effectively address class imbalance and high dimensionality in Big Data, outperforming prior methods on the ECBDL’14 dataset while lowering computational costs. Our study reaches beyond the work of Hasanin et al. not only because of our more sophisticated feature selection technique, but also because of our evaluation of more classifiers from the Bagging and Boosting families of classifiers, as well as our use of two highly imbalanced, Big Medicare Data datasets. A significant difference between our studies is in their application domains. Our study is in the healthcare insurance fraud detection domain, whereas Hasanin et al.’s use data from the protein structure prediction application domain. A more important difference between our studies is the methodologies used. Hasanin et al. employ only one scenario of applying RUS and feature selection. Therefore, it seems more preliminary, and not comprehensive. We investigate four data reduction scenarios, two of which involve RUS and feature selection combined, and we provide results that indicate which scenario yields the best results.

“Explainable machine learning models for Medicare fraud detection” by Hancock et al. [16] is a study which focuses on the use of feature selection to build more explainable models. Feature selection engenders simpler Machine Learning models, which are thus easier to explain. While this study employs feature selection, and is in the same application domain as Hancock et al.’s, this study has a different focus, and provides a different contribution. First, this study is focused on data reduction. Hence, we employ sampling techniques as well as feature selection. Second, we cover the options one faces when designing experiments with multiple data reduction techniques. The study by Hancock et al. does not involve the different methodologies practitioners may use when designing experiments that involve multiple data reduction techniques. We provide a detailed explanation of the methodologies one may employ to use more than one data reduction technique. The statistical analysis we provide is an example of how to compare experimental outcomes involving multiple data reduction techniques. Such an analysis is not provided by Hancock et al. For these reasons, this study is vastly different from Hancock et al.

Our literature review reveals an opportunity to extend the field of research in the application of Machine Learning to the task of Medicare insurance fraud detection. Of the related studies we surveyed, we did not find a study that provides the in-depth coverage of the application of an ensemble feature selection technique. Moreover, we found some studies did not contain results of experiments combining feature selection and RUS. Of the studies that use a combination of RUS and Feature Selection, they do not contain an investigation of the different scenarios of RUS and Feature selection, the effects of applying them singly or combined, and the order of combination. Other studies only made use of one dataset. Many of the studies failed to provide the extensive array of classifiers that we use here. In summary, our review of literature exposed the need for a comprehensive study into the effects of feature selection and sampling techniques in the classification of multiple highly imbalanced Big Data datasets where experimental outcomes are documented in terms of AUPRC, and backed by the statistical analyses to prove the efficacy of the data reduction techniques.

Datasets

In our research, we employ datasets synthesized from two US government agencies, specifically the CMS and the United States Office of Inspector General (OIG). Data from the CMS is constituted by Medicare Health insurance plans, known as Part B Part D. To prepare these data for supervised Machine Learning applications, data aggregation is employed for both Part B and Part D datasets. Our datasets derive from publicly accessible Medicare information for the years 2013 to 2019. These datasets are procured in a comma-separated format from the CMS website. Within the sphere of health insurance, a plan delineates the coverage agreement between the insurer and the insured, specifying the range of treatments and medications that the insurance provider will pay for. Part B is oriented towards covering treatments and procedures, whereas Part D is tailored to cover prescription medications. After this preprocessing phase, the datasets are completed through a labeling process. This additional information is sourced from the OIG’s LEIE [7].

The methodologies for compiling and preprocessing these datasets have been previously detailed in [17]. Before elaborating on the data aggregation and labeling processes, we discuss the Medicare Parts B and D datasets in detail. To bolster our understanding of the attributes present within these datasets, we consult data dictionaries supplied by CMS, which are available in the public domain [18,19,20], and [21].

For our study’s focus on Medicare Part D data, we leverage two distinct datasets. We refer to the first dataset as the “provider-drug-level Part D data”. The CMS has designated this dataset as the Medicare Part D Prescribers – by Provider and Drug dataset [22]. This source offers granular data that captures each unique combination of healthcare provider, prescribed medication, and the year when the medications were prescribed. Contrastingly, the second source, which we call the “provider-level Part D data”, is collected from the Medicare Part D Prescribers – by Provider dataset [23]. This latter dataset provides a broader view, furnishing only one record for each healthcare provider per year. Thus, difference between the provider-drug-level Part D data and the provider-level Part D data lies in their respective degrees of specificity, with the former offering a more detailed account of prescription practices than the latter.

Table 1 Provider-drug-level part D base features, descriptions from [20]

Data reduction techniques for highly imbalanced medicare Big Data

Abstract

Introduction

Related work

Datasets

Classifiers

Methodology

Results

Part D scenario one: RUS only

Part D scenario two: feature selection only

Part D scenario 3: feature Selection, then RUS 1:81

Part D scenario four: RUS 1:81 then feature selection

Part B scenario one: RUS only

Part B scenario two: feature selection only

Part B scenario three: feature selection then RUS 1:81

Part B scenario four: RUS 1:81 then feature selection

Statistical analysis

Part D scenario one: RUS only

Part D scenario two: feature selection Only

Part D scenario 3: feature selection, then RUS 1:81

Part D Scenario Four: RUS 1:81 then feature selection

Part D, inter-scenario analysis

Part B statistical analysis

Conclusions

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendices

Appendix A statistical analysis of experiments with part B Data

Part B scenario 2: feature selection only

Part B scenario three, feature selection, then RUS

Part B, scenario four, RUS 1:81, then feature selection

Part B inter-scenario analysis

Rights and permissions

About this article

Cite this article

Share this article

Keywords