Medicare fraud detection using neural networks

Johnson, Justin M.; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-019-0225-0

Research
Open access
Published: 18 July 2019

Medicare fraud detection using neural networks

Journal of Big Data volume 6, Article number: 63 (2019) Cite this article

18k Accesses
125 Citations
1 Altmetric
Metrics details

Abstract

Access to affordable healthcare is a nationwide concern that impacts a large majority of the United States population. Medicare is a Federal Government healthcare program that provides affordable health insurance to the elderly population and individuals with select disabilities. Unfortunately, there is a significant amount of fraud, waste, and abuse within the Medicare system that costs taxpayers billions of dollars and puts beneficiaries’ health and welfare at risk. Previous work has shown that publicly available Medicare claims data can be leveraged to construct machine learning models capable of automating fraud detection, but challenges associated with class-imbalanced big data hinder performance. With a minority class size of 0.03% and an opportunity to improve existing results, we use the Medicare fraud detection task to compare six deep learning methods designed to address the class imbalance problem. Data-level techniques used in this study include random over-sampling (ROS), random under-sampling (RUS), and a hybrid ROS–RUS. The algorithm-level techniques evaluated include a cost-sensitive loss function, the Focal Loss, and the Mean False Error Loss. A range of class ratios are tested by varying sample rates and desirable class-wise performance is achieved by identifying optimal decision thresholds for each model. Neural networks are evaluated on a 20% holdout test set, and results are reported using the area under the receiver operating characteristic curve (AUC). Results show that ROS and ROS–RUS perform significantly better than baseline and algorithm-level methods with average AUC scores of 0.8505 and 0.8509, while ROS–RUS maximizes efficiency with a 4× speedup in training time. Plain RUS outperforms baseline methods with up to 30× improvements in training time, and all algorithm-level methods are found to produce more stable decision boundaries than baseline methods. Thresholding results suggest that the decision threshold always be optimized using a validation set, as we observe a strong linear relationship between the minority class size and the optimal threshold. To the best of our knowledge, this is the first study to compare multiple data-level and algorithm-level deep learning methods across a range of class distributions. Additional contributions include a unique analysis of the relationship between minority class size and optimal decision threshold and state-of-the-art performance on the given Medicare fraud detection task.

Introduction

Medicare is a United States (U.S.) healthcare program established and funded by the Federal Government that provides affordable health insurance to individuals 65 years and older, and other select individuals with permanent disabilities [1]. According to the 2018 Medicare Trustees Report [2], in 2017 Medicare provided coverage to 58.4 million beneficiaries and exceeded $710 billion in total expenditures. Medicare enrollment has grown to 60.6 million as of February 2019 [3]. There are many factors that drive the costs of healthcare and health insurance, including fraud, waste, and abuse (FWA) within the healthcare system. The Federal Bureau of Investigation (FBI) estimates that fraud accounts for 3–10% of all billings [4], and the Coalition Against Insurance Fraud [5] estimates that fraud costs all lines of insurance roughly $80 billion per year. Based on these estimates, Medicare is losing between $21 and $71 billion per year to FWA. Examples of fraud include billing for appointments that the patient did not keep, billing for services more complex than those performed, or billing for services not provided. Abusive practice is practice inconsistent with providing patients medically necessary services according to recognized standards, e.g. billing for unnecessary medical services or misusing billing codes for personal gain. Federal laws are in place to govern Medicare fraud and abuse, for example the False Claims Act (FCA) and Anti-Kickback Statute [6].

One way to improve the cost, efficiency, and quality of Medicare services is to reduce the amount of FWA. Manually auditing and investigating all Medicare claims data for fraud is very tedious and inefficient when compared to machine learning and data mining approaches [7]. As of 2017, 86% of office-based physicians and more than 96% of reported hospitals have adopted electronic health record (EHR) systems in accordance with the Health Information Technology for Economic and Clinical Health Act of 2009 and the Federal Health IT Strategic Plan [8, 9]. This explosion in healthcare-related data encourages the use of data mining and machine learning for detecting patterns and making predictions. The Centers for Medicare and Medicaid Services (CMS) joined this data-driven effort by making Medicare data sets publicly available, stating that bad actors intent on abusing federal health care programs cost taxpayers billions of dollars and risks the well-being of beneficiaries [6].

Fraud detection using CMS Medicare data presents several challenges. The problem is characterized by the four Vs of big data: volume, variety, velocity, and veracity [10, 11]. The 9 million records released by CMS each year satisfies both high volume and velocity. Variety arises from the mixed-type high-dimensional features and the combining of multiple data sources. These data sets also exhibit veracity, or trustworthiness, as they are provided by reputable government resources with transparent quality controls and detailed documentation [12, 13]. The processing of big data often exceeds the capabilities of traditional systems and demands specialized architectures or distributed systems [14]. Another challenge is that the positive class of interest makes up just 0.03% of all records, creating a severe class-imbalanced distribution. Learning from such distributions can be very difficult, and standard machine learning algorithms will typically over-predict the majority class [15]. This paper expresses the level of class imbalance within a given data distribution as $N_{neg}{:}N_{pos}$, where $N_{neg}$ and $N_{pos}$ correspond to the percentage of samples in the negative and positive classes, respectively.

We believe that deep learning is an important area of research that will play a critical role in the future of modeling class-imbalanced big data. Over the last 10 years, deep learning methods have grown in popularity as they have improved the state-of-the-art in speech recognition, computer vision, and other domains [16]. Their recent success can be attributed to an increased availability of data, improvements in hardware and software [17,18,19,20,21], and various algorithmic breakthroughs that speed up training and improve generalization to new data [22]. Deep learning is a sub-field of machine learning that uses the artificial neural network (ANN) with two or more hidden layers to approximate some function $f^{*}$, where $f^{*}$ can be used to map input data to new representations or make predictions [23]. The ANN, inspired by the biological neural network, is a set of interconnected neurons, or nodes, where connections are weighted and each neuron transforms its input into a single output by applying a non-linear activation function to the sum of its weighted inputs. In a feedforward network, input data propagates through the network in a forward pass, each hidden layer receiving its input from the previous layer’s output, producing a final output that is dependent on the input data, the choice of activation function, and the weight parameters [24]. Gradient descent optimization adjusts the network’s weight parameters in order to minimize the loss function, i.e. the error between expected output and actual output. Composing multiple non-linear transformations creates hierarchical representations of the input data, increasing the level of abstraction through each transformation. The deep learning architecture, i.e. deep neural network (DNN), achieves its power through this composition of increasingly complex abstract representations [23]. Despite the success of DNN models in various domains, there is limited research that evaluates the use of deep learning for addressing class imbalance [25].

This study compares six deep learning methods for addressing class imbalance and assesses the importance of identifying optimal decision thresholds when training data is imbalanced. We expand upon existing Medicare fraud detection work [26] using the Medicare Provider Utilization and Payment Data: Physician and Other Supplier Public Use File provided by CMS, as it provides a firm baseline with traditional machine learning methods. This data set, referred to as Part B data hereafter, provides information on the services and procedures provided to Medicare beneficiaries and is currently available on the CMS website for years 2012–2016 [27]. The Part B data set includes both provider-level and procedure-level attributes, including the amounts charged for procedures, the number of beneficiaries receiving the procedure, and the payment reimbursed by Medicare. To enable supervised learning, fraud labels are mapped to the Part B claims data using the List of Excluded Individuals and Entities (LEIE) [13]. Since we are most interested in detecting fraud, we refer to the group of fraudulent samples as the positive class and the group of non-fraudulent samples as the negative class. The LEIE is maintained by the Office of Inspector General (OIG), and its monthly releases list providers that are prohibited from participating in Federal healthcare programs. Under the Exclusion Statute [28], the OIG must exclude providers convicted of program-related crimes, patient abuse, and healthcare fraud.

With three data-level and three algorithm-level methods for addressing class imbalance, multiple configurations for each method, and two network architectures, we evaluate the performance of 42 distinct DNN models. Data-level methods for addressing class imbalance include random over-sampling (ROS), random under-sampling (RUS), and combinations of random over-sampling and random under-sampling (ROS–RUS). Multiple class distributions are tested for each method, i.e. 40:60, 50:50, 60:40, 80:20, and 99:1. Algorithm-level methods include cost-sensitive learning and two loss functions specifically designed to increase the impact of the minority class during training, i.e. Mean False Error Loss (MFE) [29] and Focal Loss (FL) [30]. To further offset the bias towards the majority class, we calculate an optimal decision threshold for each method using a validation set. For each method configuration, we train 30 models and report the average area under the receiver operating characteristic curve (ROC AUC) [31] score on a 20% holdout test set. Analysis of variance (ANOVA) [32] and Tukey’s HSD (honestly significant difference) [33] tests are used to estimate the significance of the results. The mean optimal decision threshold, true positive rate (TPR), true negative rate (TNR), geometric mean, and training time are also reported for each method.

Results indicate that eliminating class imbalance from the training data through ROS or ROS–RUS produces significantly better AUC scores than all other methods, i.e. 0.8509 and 0.8505. While ROS methods perform best using the 50:50 class ratio, plain RUS outperforms baseline methods and achieves its highest AUC score with a 99:1 class ratio. Tukey’s HSD test shows that there is no significant difference between AUC scores of algorithm-level methods and baseline models, but we show that the algorithm-level methods yield more stable decision boundaries than the baseline models. Analysis of training times further suggests combining ROS and RUS when working with big data and class imbalance, as the balanced training distribution yields superior results and the under-sampling component improves efficiency. Results also show that the optimal decision threshold is highly correlated with the minority class size. Hence, we suggest that the decision threshold always be optimized with a validation set when training data is imbalanced. To the best of our knowledge, this is the first study to compare DNN loss functions designed for addressing class imbalance with random sampling methods that consider multiple class distributions. Additional contributions include a unique thresholding assessment that stresses the importance of optimizing classification decision thresholds and state-of-the-art performance on the given Medicare Part B fraud detection task.

The remainder of this paper is outlined as follows. The "Related works" section discusses other works related to CMS Medicare data, fraud detection, and deep learning with class-imbalanced data. The CMS and LEIE data sets are described in full, including all pre-processing steps, in the "Data sets" section. The "Methodology" section explains the experiment framework, hyperparameter tuning, class imbalance methods, and performance criteria. Results are presented in the "Results and discussion" section, and the "Conclusion" section concludes the study with areas for future works.

Related works

Since CMS released the Public Use Files (PUF) in 2014, a number of studies relating to Medicare anomaly and fraud detection have been conducted. We have selected this data set to evaluate deep learning methods for addressing class imbalance because it exhibits severe class imbalance (99.97:0.03) and previous work has left an opportunity for improvement. This section discusses the fraud-related works performed by our research group and others and outlines studies that consider the effects of class imbalance on deep learning.

Medicare fraud detection

Our research group has performed extensive research on detecting anomalous provider behavior using the CMS PUF data. In [34], Bauder and Khoshgoftaar proposed an outlier detection method based on Bayesian inference that detected fraud within Medicare. This study used a small subset of the 2012–2014 Medicare Part B data by selecting dermatology and optometry claims from Florida office clinics for analysis. The authors demonstrated the model’s ability to identify outliers with credibility intervals, and successfully validated the model using claims data from a known Florida provider that was under criminal investigation for excessive billing. In another study [35], Bauder and Khoshgoftaar use a subset of the 2012–2013 Medicare Part B data, i.e. Florida claims only, to model expected amounts paid to providers for services rendered to patients. Claims data is grouped by provider type, and five different regression models are used to model expected payment amounts. Actual payment amount deviations from the expected payment amounts are then used to flag potential fraudulent providers. Of the five regression methods tested, the multivariate adaptive regression splines [36] model is shown to outperform others in most cases, but the authors state that model selection varies between provider types. In [37], Bauder et al. used a Naive Bayes classifier to predict provider specialty types, suggesting that providers practicing outside their specialty norm warrant further investigation. This study also used a Florida-only subset of 2013 Medicare Part B claims data, but it included all 82 provider types, or classes, yielding 40,490 unique physicians and 2789 unique procedure codes. Recall, precision, and F1-scores were used to evaluate the model, showing that 7 of 82 classes scored very highly ($\text {F1-score} > 0.90$), and 18 classes scored reasonably ($0.5< \text {F1-score} < 0.90$). The authors conclude that specialties with unique billing procedures, e.g. audiologist or chiropractic, are able to be classified with high precision and recall. Herland et al. [38] expanded upon the work from [37] by incorporating 2014 Medicare Part B data and real-world fraud labels defined by the LEIE data set. Providers are labeled as fraudulent when the Naive Bayes model misclassifies the provider’s specialty type, and LEIE ground truth fraud labels are used to evaluate performance. They found that removing specialty types that have many overlapping procedures improves overall performance, e.g. Internal Medicine and Family Practice. Similarly, the authors showed that grouping like specialties improves performance further still, yielding an overall accuracy of 67%. In a later study, Bauder and Khoshgoftaar [39] merge 2012–2015 Medicare Part B data sets, map fraud labels using LEIE data, and compare multiple learners on all available data. Rather than focus on Florida-specific claims, like earlier reports, this study includes all available data for the given years, yielding 37,147,213 instances. Class imbalance is addressed with RUS, and various class distributions are generated to identify the optimal imbalance ratio for training. ANOVA is used to evaluate the statistical significance of ROC AUC scores, and the C4.5 decision tree and logistic regression (LR) learners are shown to significantly outperform the support vector machine (SVM). The 80:20 class distribution outperformed all other distributions tested, i.e. 50:50, 65:35, and 75:25. These studies jointly show that Medicare Part B claims data contains sufficient variability to detect bad actors and that the LEIE data set can be reliably used for ground truth fraud labels.

Our study is most closely related to the work performed by Herland et al. in [26], which uses three different 2012–2015 CMS Medicare PUF data sets, i.e. Part B, Part D [40], and DMEPOS [41]. Part B, Part D, and DMEPOS claims data are used independently to perform cross-validation with LR, random forest (RF), and Gradient Boosted Tree (GBT) learners. The authors also construct a combined data set by merging Part B, Part D, and DMEPOS and assess the performance of each learner to determine if models should be trained on individual sectors of Medicare claims or all available claims data. The combined and Part B data sets scored the best on ROC AUC, and the LR learner was shown to perform significantly better than GBT and RF with a maximum ROC AUC score 0.816. We follow the same protocol in preparing data for supervised learning, described in the "Data sets" section, so that we may compare deep learning results to those of traditional learners.

A number of other research groups have explored the use of CMS Medicare and LEIE data for the purpose of identifying patterns, anomalies, and potentially fraudulent activity. Feldman and Chawla [42] explored the relationship between medical school training and the procedures performed by physicians in practice in order to detect anomalies. The 2012 Medicare Part B data set was linked with provider-level medical school data obtained through the CMS physician compare data set [43]. Significant procedures for schools were used to evaluate school similarities and present a geographical analysis of procedure charges and payment distributions. Ko et al. [44] used the 2012 CMS data to analyze the variability of service utilization and payments, and found that the number of patient visits was strongly correlated with Medicare reimbursement. They also found that in terms of services per visit there was a high utilization variability and a possible 9% savings within the field of Urology. Chandola et al. [45] use healthcare claims and fraudulent provider labels provided by the Texas Office of Inspector General’s exclusion database to detect anomalies and bad actors. They employ social network analysis, text mining, and temporal analysis to show that typical treatment profiles can be used to compare providers and highlight abuse. A weighted LR model was used to classify bad actors, and experimental results showed that the inclusion of the provider specialty attribute increases the ROC AUC score from 0.716 to 0.814. Branting et al. [46] propose a graph-based method for estimating healthcare fraud risk within the 2012–2014 CMS PUF and LEIE data sets. Since the LEIE data set contains many missing NPI values, the authors use the National Plan and Provider Enumeration System (NPPES) [47] data set from 2015 to identify additional NPIs within the LEIE data set. This allowed the authors to increase the total positive fraudulent provider count to 12,000, which they then combined with 12,000 randomly selected non-fraudulent providers. Features are constructed from behavioral similarity between known fraudulent providers and non-fraudulent providers and risk propagation through geospatial collocation, i.e. shared addresses. A J48 decision tree learner was used to classify fraud with tenfold cross-validation, yielding a mean ROC AUC of 0.96. We believe this high AUC score is misleading, however, as the class-balanced context created by the authors is not representative of the naturally imbalanced population.

Deep learning with class imbalance

In a recent paper [25], we surveyed deep learning methods for addressing class imbalance. Despite advances in deep learning, and its increasing popularity, many researchers agree that the subject of deep learning with class-imbalanced data is understudied [29, 48,49,50,51,52]. For the purpose of this study, we have selected a subset of data-level and algorithm-level methods for addressing class imbalance to be applied to Medicare fraud detection.

Anand et al. [53] studied the effects of class imbalance on the backpropagation algorithm in shallow networks. The authors show that when training networks with class-imbalanced data, the length of the majority class’s gradient component that is responsible for updating network weights dominates the component derived by the minority class. This often reduces the error of the majority group very quickly during early iterations while consequently increasing the error of the minority group, causing the network to get stuck in a slow convergence mode. The authors of the related works in this section apply class imbalance methods to counter this effect and improve the classification of imbalanced data with neural networks.

The related works in this section often use Eq. (1) to describe the maximum between-class imbalance level, i.e. the size of the largest class divided by the size of the smallest class. $C_{i}$ is a set of examples in class i, and $max_i\{|C_{i}|\}$ and $min_i\{|C_{i}|\}$ return the maximum and minimum class size over all i classes, respectively. This can be used interchangeably with our notation, e.g. a class distribution of 80:20 can be denoted by $\rho = 4$.

$$\begin{aligned} \rho = \frac{max_{i}\{|C_{i}|\}}{min_{i}\{|C_{i}|\}} \end{aligned}$$

(1)

Data-level methods

Hensman and Masko [54] explored the effects of ROS on class imbalanced image data generated from the CIFAR-10 [55] data set. The authors generated ten imbalanced distributions by varying levels of imbalance across classes, testing a maximum imbalance ratio of $\rho = 2.3$. The ROS method duplicates minority class examples until all classes are balanced, where any class whose size is less than that of the largest is considered to be a minority. This increases the size of the training data, therefore increasing training time, and has also been shown to cause over-fitting in traditional machine learning models [56]. Applying ROS until class imbalance was eliminated succeeded in restoring model performance, and achieved results comparable to the baseline model that was trained on the original balanced data set.

Buda et al. [52] presented similar results and showed that ROS generally outperforms RUS and two-phase learning. The RUS method used by Buda et al. randomly removes samples from the majority group until all classes are of equal size, where any class larger than the smallest class is treated as a majority class. If the data is highly imbalanced, under-sampling until class balance is achieved may result in discarding many samples. This can be problematic with high capacity neural network learners, as more training data is one of the most effective ways to improve performance on the test set [23]. Two-phase learning addresses this issue by first training a model on a balanced data set, generated through ROS or RUS, and then fine-tuning the model on the complete data set. By simulating class imbalance ratios in the range $\rho \in [10, 100]$ on three popular image benchmarks, it was shown that applying ROS until class imbalance is eliminated outperforms both RUS and two-phase learning in nearly all cases. Results from Buda et al. discourage the use of RUS, as it generally performs the worst. Experiments by Dong et al. [57] support these findings and show that over-sampling outperforms under-sampling on the CelebA data set [58] with a max imbalance ratio of $\rho = 49$.

In this paper, we explore the use of ROS and RUS to address class imbalance at the data-level. Due to time constraints, we leave two-phase learning and other advanced data sampling strategies for future works, e.g. dynamic sampling [50]. The related works listed here conclude that over-sampling the minority class until the imbalance is eliminated from the training data yields the best results. They do not, however, consider big data problems exhibiting class rarity, i.e. where very few positive examples exist. Contrary to these related works, comprehensive experiments by Van Hulse et al. [15] suggest that RUS outperforms ROS when using traditional machine learning algorithms, i.e. non-deep learning. We believe that RUS will play an important role in training deep models with big data. We extend these related works by testing various levels of class imbalance and combining ROS with RUS to generate class-balanced training data. This is the first study to compare ROS, RUS, and ROS–RUS deep learning methods across a range of class distributions.

Algorithm-level methods

Wang et al. [59] employed a cost-sensitive deep neural network (CSDNN) method to detect hospital readmissions, a class imbalanced problem where a small percentage of patients are readmitted to a hospital shortly after their original visit. The authors used a Weighted CE loss (Eq. 2) to incorporate misclassification costs directly into the training process, where $p_{i}$ is the model output activation that denotes the estimated probability of observing the ground truth label $y_{i}$.

$$\begin{aligned} \textit{Weighted CE loss} = - \sum _{i}^{C} w_{i} \cdot y_{i} \cdot log(p_{i}) \end{aligned}$$

(2)

The weighted CE loss multiplies the loss for each class $i \in C$, i.e. $y_{i} \cdot log(p_{i})$, by the corresponding class weight $w_{i}$. They show that increasing the weight of the minority class to $1.5\times$ and $2\times$ that of the majority class improves classification results and outperforms several baselines, e.g. decision trees, SVM, and a baseline ANN. Incorporating the cost matrix into the CE loss is a minor implementation detail that is often built into modern deep learning frameworks, making the selection of an optimal cost matrix the most difficult task. Cost matrices can be defined by empirical work, domain knowledge, class priors, or through a search process that tests a range of values while monitoring performance on a validation set.

Wang et al. [29] present the novel Mean False Error (MFE) loss function and compare it to the Mean Squared Error (MSE) loss function using imbalanced text and image data. They constructed eight imbalanced data sets with $\rho \in [5,20]$ by sampling the CIFAR-100 [60] and 20 Newsgroup [61] image and text data sets. After demonstrating how the MSE loss is dominated by the majority class, and their models fail to converge, they proposed the MFE loss to increase the sensitivity of errors in the minority class. The proposed loss function was derived by splitting the MSE loss into two components, Mean False Positive Error (FPE) and Mean False Negative Error (FNE). The FPE (Eq. 3) and FNE (Eq. 4) values are combined to define the total system loss, MFE (Eq. 5), as the sum of the mean errors from each class. The proposed MFE loss function, and its Mean Squared False Error (MSFE) (Eq. 6) variant, are shown to outperform the MSE loss in nearly all cases. Improvements over the baseline MSE loss are most apparent when class imbalance is greatest, i.e. imbalance levels of 95:5. For example, the MSFE loss improved the classification of Household image data and increased the F1-score from 0.1143 to 0.2353 when compared to MSE.

$$\begin{aligned} FPE= & {} \frac{1}{N} \sum \limits _{i=1}^{N} \sum _{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2} \end{aligned}$$

(3)

$$\begin{aligned} FNE= & {} \frac{1}{P} \sum \limits _{i=1}^{P} \sum _{n} \frac{1}{2} (d_{n}^{(i)} - y_{n}^{(i)})^{2} \end{aligned}$$

(4)

$$\begin{aligned} MFE= & {} FPE + FNE \end{aligned}$$

(5)

$$\begin{aligned} MSFE= & {} FPE^{2} + FNE^{2} \end{aligned}$$

(6)

It is unclear if Wang et al. performed mini-batch stochastic gradient descent (SGD) or standard batch gradient descent. We feel that this should be considered in future works, because when class imbalance levels are high, mini-batch gradient descent may contain many batches with no positive samples. This leads to many weight updates uninfluenced by the positive class of interest. Larger mini-batches or batch gradient descent will alleviate this problem, but the benefits of smaller mini-batches may prove more valuable [23].

Lin et al. [30] proposed the FL (Eq. 7) function to address the class imbalance inherent to object detection problems, where positive foreground samples are heavily outnumbered by negative background samples. The FL reshapes the CE loss in order to reduce the impact that easily classified samples have on the loss by multiplying the CE loss by a modulating factor, ${\alpha _{t}(1 - p_{t})^{\gamma }}$. Hyper parameter $\gamma \ge 0$ adjusts the rate at which easy examples are down-weighted, and $\alpha _{t} \ge 0$ is a class-wise weight that is used to increase the importance of the minority class. Easily classified examples, where ${p_{t} \rightarrow 1}$, cause the modulating factor to approach 0 and reduce the sample’s impact on the loss.

$$\begin{aligned} FL(p_{t}) = -\alpha _{t}(1 - p_{t})^{\gamma }\log (p_{t}) \end{aligned}$$

(7)

The proposed one-stage FL model, RetinaNet, is evaluated against several state-of-the-art one-stage and two-stage detectors on the COCO [62] data set. It outscores the runner-up one-stage detector (DSSD513 [63]) and the best two-stage detector (Faster R-CNN with TDM [64]) by 7.6-point and 4.0-point precision gains, respectively. When compared to several Online Hard Example Mining (OHEM) [65] methods, RetinaNet outscores the best method with an increase in AP from 32.8 to 36.0. By down-weighting the loss of easily learned samples, the FL lends itself to not just class imbalanced problems, but also hard-sample problems. Nemoto et al. [66] later used the FL for the automated detection of rare building changes, e.g. new construction, and concluded that FL improves problems related to class imbalance and over-fitting. The results provided by Nemoto et al. are difficult to interpret, however, because the FL and baseline experiments are conducted on different data distributions.

The final algorithm-level technique we consider is output thresholding, i.e. adjusting the output decision threshold that computes class labels from the model’s output activation. Buda et al. applied output thresholding to all experiments by dividing network outputs for each class by their estimated priors and found that combining ROS with thresholding worked especially well. Appropriate decision thresholds can also be identified by varying the threshold and comparing results on a validation set. While it is rarely discussed in related deep learning work, we believe that thresholding is a critical component of training neural networks with class imbalanced data.

In this study, we evaluate the use of cost-sensitive learning, MFE loss, FL, and output thresholding. The first three methods are modifications to the loss function that influence network weight updates by increasing the impact of the minority class during training. Output thresholding, on the other hand, does not affect training and only changes the cutoff threshold that is used for determining class labels from output scores. Unlike data-level methods, these algorithm-level methods do not change the data distribution and should only have a marginal effect on training times. One disadvantage is that two of the three algorithm-level methods increase the number of tunable hyperparameters, making the process of searching for appropriate hyperparameters more time consuming.

Data sets

In this paper we use the Medicare Part B data set provided by CMS [27] for years 2012–2016, namely the Medicare Provider Utilization and Payment Data: Physician and Other Supplier PUF. To enable supervised learning, a second data set, the List of Excluded Individuals and Entities (LEIE) [13], is used to label providers within the Medicare Part B data set as fraudulent or non-fraudulent. In this section, we describe these data sets in detail and discuss the pre-processing steps that we take to create the final labeled data set. This process follows the procedures outlined by Herland et al. [26].

Medicare Part B data

The Medicare Part B claims data set describes the services and procedures that healthcare professionals provide to Medicare’s Fee-For-Service beneficiaries. Records within the data set contain various provider-level attributes, e.g. National Provider Identifier (NPI), first and last name, gender, credentials, and address. The NPI is a unique 10-digit identification number for healthcare providers [67]. In addition to provider-level details, records contain claims information that describe a provider’s activity within Medicare over a single year. Examples of claims data include the procedure performed, the average charge amount submitted to Medicare, the average amount paid by Medicare, and the place of service. The procedures rendered are encoded using the Healthcare Common Procedures Coding System (HCPCS) [68]. For example, HCPCS codes 99219 and 88346 are used to bill for hospital observation care and antibody evaluation, respectively. Also included in the claims data is the provider type, a categorical value describing the provider’s specialty that is derived from the original claim.

For each annual release, CMS aggregates the data over: (1) provider NPI, (2) HCPCS code, and (3) place of service. This produces multiple records for each provider, with one record for each HCPCS code and place of service combination. CMS decided to separate claims data by place of service, i.e. facility versus non-facility, because procedure fees will vary depending on where the service was performed [12]. An example of the 2016 Part B data set is presented in Table 1.

Table 1 Sample of Part B data set

Medicare fraud detection using neural networks

Abstract

Introduction

Related works

Medicare fraud detection

Deep learning with class imbalance

Data-level methods

Algorithm-level methods

Data sets

Medicare Part B data

LEIE data

Fraud labeling

Data processing

Methodology

Runtime environment

Baseline models

Class imbalance methods

Data-level methods

Algorithm-level methods

Performance metrics

Threshold moving

Significance testing

Results and discussion

Baseline model performance

RUS performance

ROS performance

ROS–RUS performance

Cost-sensitive performance

MFE and MSFE performance

Focal loss performance

Statistical analysis

Training time analysis

Analysis of decision thresholds

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords