Investigating the relationship between time and predictive model maintenance

A majority of predictive models should be updated regularly, since the most recent data associated with the model may have a different distribution from that of the original training data. This difference may be critical enough to impact the effectiveness of the machine learning model. In our paper, we investigate the relationship between time and predictive model maintenance. Our work incorporates severely imbalanced big data from three Medicare datasets, namely Part D, DMEPOS, and Combined, that have been used in several fraud detection studies. We build training datasets from year-groupings of 2013, 2014, 2015, 2013–2014, 2014–2015, and 2013–2015. Our test datasets are built from the 2016 data. To mitigate some of the adverse effects from the severe class imbalance in these datasets, the performance of five class ratios obtained by Random Undersampling and five learners is evaluated by the Area Under the Receiver Operating Characteristic Curve metric. The models producing the best values are as follows: Logistic Regression with the 2015 year-grouping at a 99:1 class ratio (Part D); Random Forest with the 2014-2015 year-grouping at a 75:25 class ratio (DMEPOS); and Logistic Regression with the full 2015 year-grouping (Combined). Our experimental results show that the largest training dataset (year-grouping 2013–2015) was not among the selected choices, which indicates that the 2013 data may be outdated. Moreover, we note that because the best model is different for Part D, DMEPOS, and Combined, this suggests that these three datasets may actually be sub-domains requiring unique models within the Medicare fraud detection domain.

that our paper only focuses on predictive model maintenance vis-à-vis the dynamic nature of data distributions. Using new or improved machine learning algorithms to facilitate this maintenance is outside the scope of this paper.
We adopt an existing model for detecting Medicare fraud as our frame of reference [9]. This model was selected out of several possible combinations constructed from processed Medicare datasets (Part B, Part D, Durable Medical Equipment, Prosthetics, Orthotics and Supplies (DMEPOS), and Combined), and employs the Logistic Regression (LR) learner after applying Random Undersampling (RUS) with a 90:10 majority-to-minority class ratio. The Combined dataset was created from a join operation of the other three processed datasets. To determine whether or not a physician has committed fraud, we consult the List of Excluded Individu-als/Entities (LEIE) to obtain fraud labels which are subsequently mapped to the Medicare datasets. Our work investigates the effect of using training datasets of various year-groupings, where a grouping refers to one or more years' worth of collected data. Training datasets are created from the following year-groupings: 2013, 2014, 2015, 2013-2014, 2014-2015, and 2013-2015. The test datasets are created from 2016 data. Both training and test datasets are constructed from Part D, DMEPOS, and Combined, with the full datasets characterized as highly imbalanced big data.
Although there is no universally accepted definition of big data, data scientists often refer to the six V's: volume, variety, velocity, variability, value, and veracity [10]. Volume, the best-known characteristic of big data, is associated with the amount of data produced by an entity. Variety encompasses the handling of structured, semistructured, and unstructured data. Velocity considers the speed at which data is manufactured, issued, and handled. Variability pertains to data fluctuations. Value is frequently designated as a critical attribute with respect to effective decision-making. Veracity involves the fidelity of data.  [8] If a dataset has distinct majority and minority classes, e.g., normal and fraudulent transactions for an international bank, the data can be regarded as class-imbalanced. The imbalance is often associated with binary classification, a framework of only a majority class and a minority class, in contrast to a multi-class classification [11] framework of more than two classes. With binary classification, the minority (positive) class comprises a smaller portion of the dataset and is usually the class of interest in real-world problems [12,13]. For instance, defective hard drives advancing along the production line of a factory constitute the class of interest, while non-defective hard drives make up the majority (negative) class [14]. According to one school of thought, high or severe class imbalance can be expressed in terms of a majority-tominority ratio between 100:1 and 10,000:1 [15].
From a classification viewpoint, machine learning algorithms are usually more effective than traditional statistical techniques [16][17][18]. However, these algorithms may be unable to distinguish between majority and minority classes if the dataset is highly imbalanced. As a result, practically every instance may be labeled as the majority (negative) class, and performance metric values based on the misclassified instances could be deceivingly high. For cases where a false negative incurs a greater penalty than a false positive, a learner's bias in favor of the majority class may have adverse consequences [19,20]. As an example, in a national hospital's database of patients with migraine, a very small number will most likely test positive for brain cancer, i.e., minority class, while most are expected to test negative, i.e., majority class. In this situation, a false negative means that a patient with brain cancer has been misclassified as not having the disease, which is a very grave error.
Our contribution shows how changes in data distribution over time affect predictability with regard to the maintenance of machine learning models. To achieve this, we utilize five learners, five class ratios obtained by RUS, the Area Under the Receiver Operating Characteristic Curve (AUC) metric, and three Medicare datasets. Upon evaluation, the top models selected are as follows: Logistic Regression with the 2015 year-grouping at a 99:1 class ratio (Part D); Random Forest with the 2014-2015 year-grouping at a 75:25 class ratio (DMEPOS); and Logistic Regression with the full 2015 year-grouping (Combined). Empirical results indicate that the largest training dataset, i.e., year-grouping 2013-2015, was not among the selected choice of models, thus suggesting that the 2013 data may be dated and not beneficial for Medicare fraud detection. Since the top choice of model is different for the Part D, DMEPOS, and Combined datasets, we postulate that each of the three datasets may be a unique subdomain under the broader domain of Medicare fraud detection. To the best of our knowledge, we are the first to investigate, through the use of several big datasets, the effect of time on predictive model maintenance.
The remainder of this paper is organized as follows: "Related work" section covers related literature investigating the relationship between time and predictive model maintenance; "Datasets" section describes the Medicare datasets, along with our data processing approach; "Learners" section provides information on the learners and their configuration settings; "Methodologies" section describes the different aspects of the methodology used to develop and implement our approach; "Results and discussion" section presents and discusses our empirical results; and "Conclusion" section concludes our paper with a summary of the research work and suggestions for related future work.

Related work
There are various approaches for addressing data distribution changes with time [2]. However, apart from our recently published conference paper [21], we could not find other research works that investigate the relationship between time and predictive model maintenance for big data.
In [22], Raza et al. proposed a solution that detects dataset shift with an Exponentially Weighted Moving Average (EWMA) control chart. The chart is an established statistical method for identifying minor shifts in time-series data [23], and it joins past and current data to enable rapid detection of these shifts. The researchers evaluated model performance using both real-world and synthetic datasets. Through their model, Raza et al. attempted to detect the shift-point in a data stream. A shift-point is a marked change in one or more slopes of a linear time-series model [24]. The work in [22] is limited due to the EWMA chart assigning a time-weighted constant to past and current observations, where the inclusion of an incorrect constant could lead to shift-point misidentification.
Ikonomovska et al., in [25], recognized that data streams is a promising research area, and they developed a regression tree algorithm for identifying variations in data distributions. Starting with an empty leaf node, the algorithm sequentially reads instances and determines the best split for each feature. Features are then ranked, with the most favorable attribute split if a specific threshold is reached. Each data stream instance that arrives triggers a change detection check, and on detection of a change, the tree structure is updated. We note that if the tree model becomes too large, the model becomes complex and interpretability suffers.
With Multivariate Relevance Vector Machines [26], Torres et al. [27] examined direct and indirect approaches for forecasting daily evapotranspiration. The estimation of evaporation is important for water management and irrigation scheduling. Utilizing the Multilayer Perceptron (MLP) [28] learner as a benchmark, the researchers calculated potential crop evapotranspiration and compared crop value results in their study location. Their results showed that it is possible to accurately forecast up to four days of potential crop evapotranspiration, and that the indirect approach outperformed both the direct approach and MLP learner. In other words, after four days the phenomenon of concept drift adversely affected model performance. The study is limited by the use of only one learner as a benchmark.
Using an Adaboost-Support Vector Machines (SVMs) ensemble, Sun et al. [29] implemented time-based weighting on data batches to predict dynamic financial distress. This distress is associated with conditions such as bankruptcy and debt default [30]. The researchers designed two different algorithms for predicting financial distress. The first merges the outputs of a time-based and error-based decision expert system, and the second applies a time-based weight updating function during iterations of Adaboost. As noted previously, the use of time weighting could impact the effectiveness of study results.
Finally, in our recent conference paper [21], we examined the effect of time on the maintenance of a predictive model to detect Medicare Part B billing fraud. Training datasets were built from year-groupings of 2015, 2014-2015, 2013-2015, and 2012-2015, while the test datasets were built from 2016 data. Our study incorporated five class ratios obtained by RUS, and five learners. Using the AUC performance metric, we showed that the Logistic Regression learner produces the highest overall value for the year-grouping of 2013-2015, with a majority-to-minority ratio of 90:10. Furthermore, we concluded that a sampled dataset should be selected over the full dataset and that the largest training dataset, i.e., 2012-2015, does not always yield the best results. The work in [21] is limited to the Medicare Part B dataset, and therefore, in our current paper we remove this limitation by performing experimentation on Part D, DMEPOS, and Combined Medicare datasets.
Throughout our search for related works, we observed that existing literature on the dynamism of data distribution is mainly centered around data streams, i.e., real-time data. In our current paper, however, constructed models are based on static datasets, meaning that our predictive models have been trained on static data that was collected and processed in an offline mode. On the other hand, online processing is a requirement for data streams because this data arrives in real time and may overburden the computer system. For these online cases, predictive models are retrained with recent data batches or incrementally trained [2]. Although investigating the dynamic nature of data distribution for static databases and data streams are both equally important, it is easier to observe the distribution variations with real-time data. On account of this, static databases are frequently omitted from such studies.

Datasets
The Centers for Medicare and Medicaid Services (CMS) datasets used in this work (Part B, Part D, DMEPOS, and Combined) are discussed in this section, along with the data processing methodology and also, the LEIE dataset that provides the fraud labels. Our training and test datasets are derived from these original CMS and LEIE datasets. CMS records all claims information after payments are disbursed [31][32][33], and therefore, we consider the Medicare data to be cleansed and accurate. We note that National Provider Identifier (NPI) [34] is utilized for aggregation and identification, but not during the data mining stage. Furthermore, a year variable was added for each dataset.

Part B
The Part B dataset contains claims information for each procedure a physician performs in a specific year [35]. Physicians are identified by their unique NPI, and procedures are assigned their respective Healthcare Common Procedure Coding System (HCPCS) codes [36]. The number of procedures performed, average payments and charges, and medical specialty (referred to as provider type) are also covered by Medicare Part B. CMS has aggregated Part B data by NPI, HCPCS code, and the place of service (facility (F) such as a hospital, or non-facility (O) such as an office). Every dataset row includes an NPI, provider type, one HCPCS code matched to place of service along with related information, and other static features such as gender. For each physician, each dataset row represents a unique combination of NPI, provider type, HCPCS code, and place of service. Note that the Part B dataset is not the focus of our paper. However, Part B is a component of the Combined, a dataset integral to our current work.

Part D
The Part D dataset contains information on prescription drugs provided under the Medicare Part D Prescription Drug Program in a specific year [37]. Physicians are identified by their unique NPI and drugs are labeled according to their brand and generic name. Other information contained in the dataset includes average payments and charges, variables describing the drug quantity prescribed, and medical specialty. CMS has aggregated Part D data by NPI and the drug name. Every dataset row includes an NPI, provider type, drug name along with related information, and other static features such as gender. For each physician, each dataset row represents a unique combination of NPI, provider type, and drug name. Aggregated records, obtained from fewer than 11 claims, are omitted from the Part D data. This is done to safeguard the privacy of Medicare beneficiaries.

DMEPOS
The DMEPOS dataset contains information on Medical Equipment, Prosthetics, Orthotics and Supplies that physicians referred their patients to either buy or rent from a supplier in a specific year [38]. This dataset is derived from claims that suppliers have submitted to Medicare. The role of the physician in this case is to refer the patient to the supplier. Physicians are identified by their unique NPI [34], and products are assigned their HCPCS code. Other claims information includes the number of services/products rented or sold, average payments and charges, and medical specialty. CMS has aggregated DMEPOS data by NPI, HCPCS code, and supplier rental indicator obtained from DMEPOS supplier claims. Every dataset row includes an NPI, provider type, one HCPCS code matched to place of service along with related information, and other static features such as gender. For each physician, each dataset row represents a unique combination of NPI, provider type, HCPCS code and rental status.

LEIE
A dataset of physicians who committed fraud is necessary to accurately assess fraud detection performance in the real world. For this reason, we utilized the LEIE [39], which provides information such as reason for exclusion and date of exclusion. The LEIE was established by the Office of Inspector General (OIG) [40], which has a mandate to exclude individuals and entities from federally funded healthcare programs. We note, however, that the LEIE dataset contains NPI values for only a fraction of fraudulent physicians and entities in the US. Nationally, approximately 21% of convicted fraudulent providers have not been suspended from medical practice, and about 38% of those convicted continue to practice medicine [41].
The LEIE does not provide specific information relating to drugs, equipment, or procedures involving fraudulent activities. There are several types of exclusions that are described by various rule numbers, and we selected only rules indicating fraud was committed, as shown in Table 1 [42].

Data processing
When this study was conducted, Part B was available for 2012 through 2016, while Part D and DMEPOS were available for 2013 through 2016. We selected specific attributes among the three datasets in order to provide a solid foundation for our analyses. Also, for consistency purposes, the 2012 data of Part B was removed. For Part B, Part D, and DMEPOS, we selected eight, seven and nine features, respectively.  Excluded features contained no information on drugs provided, claims, or referrals, but instead provided provider-related information, such as location and name, as well as redundant variables. Table 2 shows the features that we selected from each original dataset [9]. All three original datasets are at the procedure level, which means they were aggregated by NPI and HCPCS codes. To conform to our methodology of mapping fraud labels with LEIE, each dataset was aggregated to the provider-level, a rearrangement that groups all information over each NPI (and other particular attributes) [43]. For each numeric value per year, we replace the variable in each dataset with the aggregated mean, median, sum, standard deviation, minimum and maximum values, creating six new attributes for each original numeric attribute.

Combined dataset
The Combined dataset entails a join operation on NPI, provider type, and year for Part B, Part D, and DMEPOS, after individual processing of these datasets [43]. As Part D contains no gender variable, this feature was not included [43]. Note that the combining of these datasets limits us to physicians who have participated in all three parts of Medicare. However, the Combined has more numerous and inclusive features than the other three Medicare datasets.

Fraud labeling
For our processed Medicare datasets, we obtain fraud labels from the LEIE dataset [43]. Only physicians within the LEIE are considered fraudulent for the purpose of this study. This dataset is joined to the Medicare datasets by NPI, and physicians practicing within a year prior to their exclusion end year are labeled fraudulent. Table 3 shows the distribution of fraud to non-fraud within the full datasets [9], which are highly or severely imbalanced. Note that year-groupings can correspond to single years, as in the case of 2015, or a combination of years, as in the case of 2013-2015. Part B, which is a component of the Combined but not the focus of this paper, is shown in the table for informational purposes only.

One-hot encoding
One-hot encoding is used in our model construction to transform categorical features into numerical ones [43]. For instance, one-hot encoding of gender generates extra features equal to the number of options (male and female). If the physician is male, the new male feature would be assigned a 1 and the female feature a 0, and vice-versa if the physician is female. Both male and female features could be assigned a 0 in cases where the original gender feature is not provided.

Learners
Our work uses five popular learners (k-NearestNeighbor (k-NN), C4.5 decision tree, Random Forest (RF), LR, Support Vector Machine (SVM)), all of which are available within Waikato Environment for Knowledge Analysis (WEKA), an open source collection of machine learning algorithms. These classifiers were chosen for their good coverage of several Machine Learning (ML) model families. Performance-wise, the five classifiers are regarded favorably, and they incorporate both ensemble and non-ensemble algorithms [44,45]. In this section, we describe each model and note configuration and hyperparameter changes that differ from the default settings in WEKA.
The k-NN learner [46], also called IBk (Instance Based Learner with parameter k) in WEKA, specifies the number of nearest neighbors to use for classification and implements distance-based comparisons among instances. The performance of KNN relies on the distance measure, with Euclidean distance being the typical choice. We assigned a value of 5 to k (5-NN), and set the 'distanceWeighting' parameter as 'Weight by 1/distance' in order to use inverse distance weighting for determining class membership [47].
C4.5 decision tree [48] uses a divide-and-conquer approach to split the data at each node based on the feature with the most information. Node attributes are automatically chosen by maximizing information gain and minimizing entropy. Entropy is a measure of the uncertainty of attributes, with information gain being the means to find the most informative attribute. Features that are most valuable are located near the root node, and the leaf nodes contain the classification results. The J48 decision tree is the standard implementation within WEKA. We set the J48 parameters to 'Laplace Smoothing' and 'no pruning' , which can improve results for imbalanced data [47].
RF [49] is an ensemble technique for assembling multiple, unpruned decision trees into a forest. Class membership is calculated by combining the results of the individual trees, usually by majority voting. Through sampling with replacement, RF produces random datasets to build each decision tree. Node features are automatically chosen based on entropy and information gain. In addition, RF uses feature subspace selection to randomly assign i features for each tree. Since RF is a random ensemble technique, data is not likely to be overfitted. With preliminary analysis indicating no difference between 100 and 500 trees, our RF learners were constructed with only 100 trees [47].
Logistic Regression [50] utilizes a sigmoid function to produce values from [0,1], which translates into class probabilities. A sigmoid function is a special case of the logistic function. LR, unlike linear regression, predicts class membership by means of a separate hypothesis class. We did not change the default setting in WEKA for the 'ridge' parameter, which is the penalized maximum likelihood estimation with a quadratic penalty function (also called L2 regularization) [47].
The Support Vector Machine learner [51] assumes that class instances are linearly separable and uses hyperplanes to separate them. The hyperplane maximizes the distance between the two classes. SVM uses regularization to prevent overfitting via the complexity parameter 'c' . In WEKA, we set the complexity parameter 'c' to 5.0. The 'buildLogisticModels' parameter, which allows probability estimates to be returned, was set to true [47].

Performance metric
Accuracy is often obtained from a simple 0.50 threshold that is incorporated into a formula for predicting one out of the two binary classes. For most real-world situations, however, the two classes are imbalanced, leading to a majority and minority class grouping. The Confusion Matrix (CM) for a binary classification problem is depicted in Table 4 [20], where Positive, the class of interest, is the minority class and Negative is the majority class. Based on these four fundamental CM metrics, other performance metrics that consider the rates between the positive and the negative class are derived as follows: • True Positive Rate ( TP rate ), also known as Recall or Sensitivity, is equal to TP/(TP + FN). • True Negative Rate ( TN rate ), also known as Specificity, is equal to TN/( TN + FP ). • False Positive Rate ( FP rate ), also known as false alarm rate, is equal to FP/(FP + TN), which usually refers to the expectancy of the false positive ratio. • Positive Predictive Value (PPV), also known as Precision, is equal to TP/(TP + FP).
The AUC metric calculates the area under the Receiver Operating Characteristic (ROC) curve, which graphically shows TP rate versus FP rate for various classification cut-offs. AUC represents the behavior of a classifier across all thresholds of the ROC curve and is a popular metric that mitigates the negative effects of class imbalance [47]. A model whose predictions are 100% correct has an AUC of 1, while a model whose predictions are 100% incorrect has an AUC of 0.

Model evaluation
A popular evaluation method in ML is train-test, in which one dataset trains the model while a separate dataset tests the model, with all instances in the test dataset completely new [9]. The train-test method determines whether, based on past occurrences, a model can accurately predict new occurrences. For our study, the train-test method indicates whether, based on previous information (year < 2016), physicians can be classified as fraudulent or non-fraudulent given new information (year = 2016).

Machine learning framework
All learners in our study were implemented within WEKA [16], an open source framework of machine learning techniques issued under the GNU General Public License. Written in Java, this framework is used for various types of machine learning tasks, such as data preparation, classification, and regression. The graphical user interfaces of WEKA contribute to its ease of use, with the software being widely used by ML researchers, industrial scientists, and students.

Random Undersampling
RUS is beneficial for imbalanced big data, as removing instances decreases computational burden and build time [52,53]. When applying RUS, the aim is to strike a balance between discarding the maximum number of majority instances while incurring the least information loss. For our research, we selected the following majority-to minority class ratios: 50:50, 65:35, 75:25, 90:10, and 99:1. These ratios were chosen because they collectively provide good distribution coverage, ranging from the balanced ratio of 50:50 to the highly imbalanced ratio of 99:1. In addition, we included the full datasets as the baseline, where RUS is not applied. For the train-test method, only the training datasets were sampled. The test datasets were only used for model evaluation, and therefore were not sampled.

Addressing randomness
Since a sample of instances is often relatively small compared to its respective original dataset, the randomization process in RUS may result in information loss, thus impacting classification performance [54]. This also means that the classification outcome could differ each time RUS is carried out, creating splits that may be deemed favorable, fair, or unlucky to the learner. Splits viewed as favorable may retain very good or clean instances that improve learner performance, but could potentially overfit the model. On the other hand, unlucky splits may retain noisy instances that weaken classification performance.
It is worth noting that some ML algorithms, such as RF, have an inherent randomness within their implementation. Furthermore, the random shuffling of instances performed before the start of each training process may cause other algorithms, such as LR, to produce different results if the order of instances is altered.
The use of repetitive methods is a proven technique for reducing the potential negative effects of randomness [55]. To address randomness during our sampling and model building stages, we performed ten repetitions per built model and selected the average of each set of repetitions.

Experiment design
This subsection highlights the main points of "Methodologies" section. To mitigate the adverse effect of high class imbalance in the full training datasets, RUS was applied. Using RUS, we obtained the following class ratios: 50:50, 65:35, 75:25, 90:10, and 99:1. The AUC learner was selected to evaluate classifier performance as it helps correct the distortion of results due to class imbalance [47]. Within the WEKA framework, model prediction was evaluated against the 2016 test set. The evaluation process was repeated ten times per training dataset, with average results reported. Table 5 shows the mean AUC values for the five learners for Part D, DMEPOS, and Combined datasets, with values ranked in descending order for each dataset. Individual rows in each table are distinguished by their specific combination of sampled class ratio and year-grouping. The term "None_Full" indicates that the full dataset, with no sampling, was used as training data.

Results and discussion
For Part D, the highest value (0.8167) corresponds to LR with the 2015 year-grouping at a 99:1 class ratio. The lowest value (0.7567) is associated with C4.5 decision Table 5  Box plots are shown in Figs. 2, 3, and 4, which represent LR for Part D, RF for DME-POS, and LR for Combined, respectively. A box plot depicts the median (50th percentile) as a thick line, two hinges (25th and 75th percentiles), two whiskers, and outlying points. With regard to Figure 2, at the 99:1 ratio, the 2015 box does not overlap with other year-groupings for LR. This indicates that the difference between year-grouping 2015 and the other year-groupings is significant. An analysis of Fig. 3 indicates that at the 75:25 ratio, the 2014-2015 box does not overlap with other year-groupings for RF, which translates into a significant difference between year-grouping 2014-2015 and the rest of the year-groupings. In Fig. 4, it is obvious there is no overlap with the short, vertical line representing 2015 and the lines for the other year-groupings, indicating that the difference between year-grouping 2015 and the other year-groupings is significant.

Fig. 2 Logistic Regression box plots (Part D)
This section is a report on our research methodologies, including reasons for choosing them. We discuss the performance metric, model evaluation, machine learning framework, Random Undersampling, addressing randomness, and experiment design. Figures 5, 6, and 7 show Tukey's Honestly Significant Difference (HSD) plots representing LR for Part D, RF for DMEPOS, and LR for Combined, respectively. Each vertical bar represents the AUC score of a group (year-grouping or distribution ratio) for a specific learner. A Tukey's HSD [56] test determines the group factors that are significant. For our experiments, we use a 5% significance level. Letter groups assigned by the test denote similarity or significant differences in results within each group or factor, with an 'a' representing the top group. year-grouping at a 75:25 class ratio (DMEPOS); and LR with the full 2015 year-grouping (Combined). It is important to note that the largest training dataset (2013-2015) did not feature among our top choices. We attribute this outcome to the likelihood that the 2013 data is outdated. In addition, it is obvious that the top choice of model for Part D, DME-POS, and Combined is not the same. Although Part D and Combined share the same LR learner and 2015 year-grouping for their top choice, experimental results show that RUS is needed to produce the best results for the former, while the full dataset yields the best results for the latter. This disparity in top choice hints that for fraud detection purposes, Part D, DMEPOS, and Combined datasets are sub-domains within the main domain of Medicare fraud detection. Each sub-domain would have a distinct data distribution, and thus require a model and RUS distribution that are also unique.

Conclusion
The regular updating of machine learning models is necessary because their original data distributions tend to change over time. These temporal changes are often detrimental to predictive effectiveness. In this paper, we analyze the impact of incorporating training data from several year-groupings on an existing predictive model Based on our results, we determined that the following models should be used in order to yield the top results: Part D-LR with the 2015 year-grouping at a 99:1 class ratio; DMEPOS-RF with the 2014-2015 year-grouping at a 75:25 class ratio; and Combined -LR with the full 2015 year-grouping. The reader should appreciate the fact that the largest year-grouping of training data (2013-2015) did not produce the highest AUC values, which signals that the 2013 data may be outdated. In addition, we note that because the top choice for predictive model is different for Part D, DMEPOS, and Combined, this suggests that these three datasets, for the purposes of Medicare fraud detection, may be sub-domains.
Future work will examine the effect of using learners, class ratios and performance metrics that are different from those utilized in this study, and also investigate the impact of sourcing big data from different application domains.