Explainable machine learning models for Medicare fraud detection

Hancock, John T.; Bauder, Richard A.; Wang, Huanjing; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-023-00821-5

Research
Open access
Published: 09 October 2023

Explainable machine learning models for Medicare fraud detection

John T. Hancock¹,
Richard A. Bauder¹,
Huanjing Wang² &
…
Taghi M. Khoshgoftaar¹

Journal of Big Data volume 10, Article number: 154 (2023) Cite this article

2768 Accesses
2 Citations
7 Altmetric
Metrics details

Abstract

As a means of building explainable machine learning models for Big Data, we apply a novel ensemble supervised feature selection technique. The technique is applied to publicly available insurance claims data from the United States public health insurance program, Medicare. We approach Medicare insurance fraud detection as a supervised machine learning task of anomaly detection through the classification of highly imbalanced Big Data. Our objectives for feature selection are to increase efficiency in model training, and to develop more explainable machine learning models for fraud detection. Using two Big Data datasets derived from two different sources of insurance claims data, we demonstrate how our feature selection technique reduces the dimensionality of the datasets by approximately 87.5% without compromising performance. Moreover, the reduction in dimensionality results in machine learning models that are easier to explain, and less prone to overfitting. Therefore, our primary contribution of the exposition of our novel feature selection technique leads to a further contribution to the application domain of automated Medicare insurance fraud detection. We utilize our feature selection technique to provide an explanation of our fraud detection models in terms of the definitions of the selected features. The ensemble supervised feature selection technique we present is flexible in that any collection of machine learning algorithms that maintain a list of feature importance values may be used. Therefore, researchers may easily employ variations of the technique we present.

Introduction

Highly dimensional Big Data can be a challenge to work with, since brute-force approaches are not practical. For example, in the process of feature selection, with smaller datasets that have a few attributes, one may simply try all possible combinations of features, build a model with each combination, and select the model that performs the best. This approach quickly becomes impractical, especially as the number of features in the dataset grows. Put another way, high dimensionality may cause issues with machine learning model performance such as a diminished capacity to generalize and longer training times. Hence, feature selection techniques are a popular topic of research in machine learning for Big Data application domains. For more information on features selection techniques, please see [1]. Another benefit of applying feature selection to the modeling process is that it yields more explainable models. Models with fewer features are easier to explain, since, at the very least, there are fewer factors to consider when theorizing about their effect on the dependent variable. In this study, we describe and apply an ensemble of supervised feature selection techniques. To the best of our knowledge, we discovered this feature selection technique, and we are certainly the first to apply it to the Medicare insurance fraud detection application domain. We show that our technique outperforms the baseline scenario of building models that use all features.

We apply our feature selection technique to build machine learning models for the automated detection of Medicare insurance fraud. Medicare is a public health insurance program in the United States. It is primarily tasked to insure individuals aged 65 and older. The Centers for Medicare and Medicaid Services (CMS) [2], is the institution responsible for overseeing the Medicare program. The CMS encourages research by maintaining publicly accessible repositories of Medicare insurance claims data. Records in the claims data have dozens of attributes. Therefore, feature selection is an appropriate subject of research that involves this data.

The principal sources of data for our investigation are two Medicare Plans, Medicare Part B, which covers treatments and procedures, and Medicare Part D, which covers prescription medications. The size and the rate at which these datasets grow reflects the flow of insurance claims submitted to CMS by healthcare providers. At present, fraudulent claims can go unnoticed, enabling dishonest providers to exploit the system. Even if only a small percentage of claims are fraudulent, the substantial volume of claims still translates to large amounts of money. In 2019, approximately three billion dollars were reclaimed from fraudulent activities by the Department of Justice [3]. However, the total amount lost to fraud remains uncertain as the CMS reports “improper payments,” which encompasses both fraudulent and mistaken payments [4]. In 2019, the CMS reported about $100 billion in improper payments. For more details on the characteristics of criminal Medicare fraud activity, please see [5].

Not only is the data provided by CMS highly dimensional, in the sense that it has many attributes, but also it contains many records, since on the order of millions of records are added annually. An effective feature selection technique is therefore highly desirable, since that reduces the overall size of the data that must be processed.

Automated, reliable fraud detection could help the CMS estimate the proportion of improper payments due to fraud, providing a solid foundation for law enforcement to recuperate stolen funds. Our interest in the Medicare fraud detection application domain is focused on employing machine learning techniques to detect Medicare fraud, contributing towards the overall aim of automated Medicare fraud detection. Enhancing fraud detection capabilities could lead to the more efficient use of government funds, and potentially lower taxes as a result of reduced program costs. At the same time, we would like to be able to prove that the fraud detection process is fair, something that can only be accomplished with explainable machine learning models. Models with fewer features are more explainable. Therefore, the feature selection technique we present here is a method for building more explainable models.

In order to show our technique is viable and should be the subject of ongoing research, we compare the performance of models built after applying our feature selection technique to the performance of models built with all features of the datasets. We show that the models we build on datasets where the feature selection technique is applied to outperform models built with all features. Therefore, this exposition of our feature selection technique constitutes a contribution to the field of machine learning applied to highly imbalanced Big Data, that future researchers can employ to reduce the size of the datasets they work with, and achieve more explainable models. In order to show the effectiveness and benefits of our feature selection technique, we have devised a study with the following sections: a survey of related work, a discussion of the datasets used, a discussion of the machine learning algorithms used, a description of our experimental methodology, including the feature selection technique, presentation of results, statistical analysis, and conclusions.

Related work

We explore the extensive research conducted in the field of machine learning on the subject of feature selection, with an emphasis on their applications in fraud detection. Although the body of work specifically addressing Medicare fraud detection is somewhat limited, we have incorporated studies examining other forms of fraud detection for a comprehensive understanding. The methodologies employed in these studies on fraud detection apply to the task of Medicare fraud detection. Although there are related studies, we find our study stands apart for its exposition of a novel feature selection technique, and its use of statistical analysis to prove the benefits of applying the technique.

Mayaki and Riveill [6] compile CMS data from the years 2017 to 2019, and label it with the List of Excluded Individuals and Entities (LEIE) [7] to form a dataset for supervised machine learning. They then build a model for detecting Medicare fraud in their dataset, which they named “Multiple Inputs Neural Network Auto-Encoder” (MINN-AE). The auto-encoder component of MINN-AE is a Long-Short Term Memory (LSTM). Auto-encoders have been successfully employed in other application domains involving the classification of highly imbalanced data [8]. Mayaki and Riveill measure the effectiveness of MINN-AE against Logistic Regression, Random Forest, Gradient Boosting, XGBoost, and five other artificial neural network models. The evaluation metrics reported in the study include precision, AUC, Area Under the Precision Recall Curve (AUPRC) [9], and geometric mean. In alignment with our views on AUC versus AUPRC, Mayaki and Riveill write about the usefulness of AUPRC when compared to AUC for datasets that are highly imbalanced. Their findings demonstrate that MINN-AE surpassed the performance of the other nine models. Despite this, the authors neglected to provide detailed information about the experimental dataset, including the count of instances and features. Moreover, unlike our study, Mayaki and Riveill do not discuss a feature selection technique.

Waspada et al. [10] apply a supervised feature selection technique, Random Forest, to rank the features in the Kaggle Credit Card Fraud Detection Dataset [11]. This dataset is similar to the Medicare Part B and Part D datasets that we use here because it is highly imbalanced. They carry out experiments with many factors. In the combination that yields the best results, they report that the top five features yield the best performance. Waspada et al. report results in terms of several metrics, including AUPRC. We agree with their conclusion that AUPRC is a more reliable metric for evaluating the classification of highly imbalanced big data. However, in the multifactorial experiments that Waspada et al. conduct, we do not find a report of a statistical analysis performed to determine the significance of the effects of the factors. In our study we conduct a statistical analysis of the impact of our feature selection technique on experimental outcomes to give a clear idea of how varying the number of features used in experiments effects classification scores. Similar to Waspada et al., we use a supervised feature selection technique. However, ours is an ensemble feature selection technique, and we present a novel method for combining multiple supervised feature selection techniques. For an overview of ensemble feature selection methods, please refer to [12].

In their research, Sailaja et al. [13] perform experiments with CMS Medicare data from 2012–2015. They label the data with the LEIE. From the CMS and LEIE data, they compile a dataset with 37,147,213 instances and nine features. Their final dataset contains 3,331 instances marked as fraudulent. Therefore, their dataset exhibits a high class imbalance, with 0.008967% of instances being fraudulent. The machine learning algorithms Sailaja et al. use are Decision Tree, Support Vector machine (SVM), and Logistic Regression. They report that Decision Tree outperformed both SVM and Logistic Regression models in overall effectiveness. Moreover, it was found that Decision Tree worked best with an 80:20 class distribution when Random Undersampling was applied to address the class imbalance in their dataset. Sailaja et al. use Area Under Receiver Operating Characteristic Curve (AUC) [14] as the metric to evaluate the results of their experiments. Over the course of our research, we have found that AUC is a misleading metric for evaluating the results of classification experiments involving highly imbalanced big data [10].

Gupta et al. [15] conducted a comparative study aimed at detecting fraudulent cases in Indian health insurance claims data. They observed that certain features within the dataset showed strong correlations with each other. To tackle this, they removed one feature from each correlated pair and used the remaining subset for fraud detection. Gupta et al.’s feature selection technique is not an ensemble technique, such as the one we present in our study. While removing correlated features is a sensible step for preparing a dataset for classification, calculating feature correlation does not provide one with a ranking of the features in a dataset that enables one to intelligently remove features from a dataset. Our study provides an exposition of an ensemble feature selection technique, which is flexible and extensible in the sense that any technique that provides a ranking of features can be incorporated in it. Furthermore, since it provides an ordered list of features as a result, one may leverage it to control the size of the feature set for experiments. Feature correlation does not provide a sense of what features may be removed without impacting performance, whereas a ranking informs one of which features are less important, and hence may be discarded without negatively effecting classification scores.

Herland et al. [16] focus on the detection of Medicare fraud with the use of datasets derived from the CMS’s publicly available data. They use Medicare Physician & Other Practitioners—by Provider and Service (Part B) [17] data from the years 2012–2015, Medicare Part D Prescribers—by Provider and Drug (Part D) [18] data from the years 2013–2015, and data from a third part of Medicare, known as Medicare Durable Medical Equipment, Devices & Supplies—by Referring Provider and Service (DMEPOS) [19] from the years. 2013–2015. Furthermore, Herland et al. compile a fourth dataset, known as the Combined dataset, by merging the Part B, Part D, and DMEPOS datasets. All four of the datasets are documented as highly imbalanced in their study. They provide details on the data processing methods for each of the four datasets, and they demonstrate how to label the datasets with the LEIE. Herland et al. build classification models with Logistic Regression, Random Forest, and Gradient Boosting classifiers for all four datasets. Their results show that the Combined dataset, when used with Logistic Regression, yields the best overall performance in detecting fraud. A key distinguishing factor between our studies is that Herland et al. do not employ feature selection techniques. Moreover, we only find results reported in terms of AUC, which, as mentioned previously, we find to be a misleading metric for classifying imbalanced Big Data.

Our review of related work leads us to the conclusion that the documentation of our work in the form of a study represents a contribution. We are the first to present the application of a new feature selection technique in the Medicare insurance fraud detection application domain. Moreover, we are the first to offer an explanation of our model’s results in terms of the reduced feature set that results in applying our ensemble supervised feature selection technique.

Datasets

The datasets in this study are compiled from information provided by the CMS and the United States Office of Inspector General (OIG). First, we discuss characteristics of the CMS Medicare data. Next, we discuss how we aggregate it, as a preprocessing step. Later we describe the labeling process, which involves the data from the OIG. We utilize two primary sources for data from the CMS in our study. They are both Medicare Health insurance plans. One plan is known as Part D, and the other is known as Part B. In the context of health insurance, a plan is simply the agreement between the insurer and the insured as to what things are covered under the insurance policy. Part D covers prescription medications, and Part B covers treatments and procedures. CMS makes different raw data available for both programs, however, we use the same technique to compile both sources into datasets suitable for supervised machine learning. This technique also involves a third source of data, from the United States Office of Inspector General which we use for labeling. The third source is the List of Excluded Individuals and Entities (LEIE) [7]. The datasets used in this study are compiled in the manner described in [20].

We obtain the Part D and Part B data from on-line sources. The CMS website offers a user interface as well as an application programming interface for examining these datasets, and to carry out rudimentary exploration of data. We acquired the Medicare datasets from the CMS site in a comma-separated format files as the basis for our datasets. The datasets are available to the public, for download. We use data spanning the years 2013–2019. The CMS provides supplementary documentation that explains the Medicare data. We utilize publicly available, CMS-provided methodology documents detailing their data gathering and processing methods. We also use CMS-provided data dictionaries that explain all accessible attributes [21,22,23,24].

We use two sources for the Part D data. The first is Medicare Part D Prescribers—by Provider and Drug [18], and the second is Medicare Part D Prescribers—by Provider [25]. The key difference between the two sources is the level of specificity of the data. The first source, Medicare Part D Prescribers—by Provider and Drug has a record for every combination of health care provider, medication that the health care provider prescribes, and year. We refer to this as the “provider-drug-level Part D data”. The second source, Medicare Part D Prescribers—by Provider is less specific. It contains a record for each provider for each year. We refer to this as the “provider-level Part D data”.

The provider-drug-level Part D data has 22 attributes. Not all of these attributes are relevant for machine learning. These are attributes related to the provider’s name and address. We eschew these attributes since they could form a unique identifier that a machine learning model could memorize instead of properly generalizing the data. We retain one identifier, the provider’s national provider identifier (NPI), which we use later for labeling purposes. The provider-drug-level Part D data has two categorical features, that identify the type of medication prescribed. During the aggregation phase of our dataset compilation, we discard these categorical features. Another feature which is ultimately discarded, but useful for processing is the year in which the claim was made. We use this feature for aggregating the Part D provider-drug-level data by year, but we do not use it as an attribute for supervised machine learning. Since it is at the provider level, when we aggregate records, we retain a categorical feature for the provider type. Numeric features in the provider-drug-level part D data are readily useful for supervised machine learning. These include data on the total volume and frequency of prescriptions a provider submits claims for, the number of patients, as well as the total cost of the claims. Furthermore, there are similar, additional features for patients aged 65 and over. There are approximately 174 million records in the collection of provider-drug-level Part D data files.

The provider-level Part D data contains 51 additional attributes pertaining to claim the provider submits to Medicare, across all the medications the provider prescribes for the year. They are listed in Table 1. The feature descriptions we provide here are from the provider-level Part D data dictionary [24]. The provider-level Part D data has ten features of summary statistics about the beneficiaries of the claims the provider submits. There is also an average beneficiary risk score. The score is calculated with a model that adjusts risk based on hierarchical condition categories (HCC). As per CMS’s methodology, beneficiaries possessing risk scores higher than the average of HCC score of 1.08 are projected to have Medicare spending that exceeds the average. The provider-level Part D data also has features for the total number of claims, total number of 30-day prescription orders, total drug cost, total day’s supply dispensed, and the total number of beneficiaries seen, in the form of subtotals within various categories of claims. The categories are Low-Income Subsidy (LIS) claim, Medicare Advantage Prescription Drug Plan (MAPD) coverage claims, and Medicare Prescription Drug Plan (PDP) claims. The statistics are also divided by several drug categories, including claims for opiate drugs, long-acting (LA) opiate drugs, antibiotic drugs, and anti-psychotic drugs.

Table 1 Provider-level Part D features, descriptions copied from [24]

Explainable machine learning models for Medicare fraud detection

Abstract

Introduction

Related work

Datasets

Algorithms

Hyperparameter settings

Methodology

Ensemble feature selection

Classification

Results

Statistical analysis

Two factor ANOVA for feature selection experiments with Part D data analysis of results in terms of AUPRC

Two factor ANOVA for feature selection experiments with Part D data analysis of results in terms of AUC

Two factor ANOVA for Supervised feature selection experiments with Part B data analysis of results in terms of AUPRC

Two factor ANOVA for Supervised feature selection experiments with Part B data analysis of results in terms of AUC

Model interpretability from feature selection and statistical analysis

Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords