Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Wang, Huanjing; Liang, Qianxin; Hancock, John T.; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-024-00905-w

Research
Open access
Published: 26 March 2024

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Huanjing Wang¹,
Qianxin Liang²,
John T. Hancock² &
…
Taghi M. Khoshgoftaar²

Journal of Big Data volume 11, Article number: 44 (2024) Cite this article

3681 Accesses
1 Citations
Metrics details

Abstract

In the context of high-dimensional credit card fraud data, researchers and practitioners commonly utilize feature selection techniques to enhance the performance of fraud detection models. This study presents a comparison in model performance using the most important features selected by SHAP (SHapley Additive exPlanations) values and the model’s built-in feature importance list. Both methods rank features and choose the most significant ones for model assessment. To evaluate the effectiveness of these feature selection techniques, classification models are built using five classifiers: XGBoost, Decision Tree, CatBoost, Extremely Randomized Trees, and Random Forest. The Area under the Precision-Recall Curve (AUPRC) serves as the evaluation metric. All experiments are executed on the Kaggle Credit Card Fraud Detection Dataset. The experimental outcomes and statistical tests indicate that feature selection methods based on importance values outperform those based on SHAP values across classifiers and various feature subset sizes. For models trained on larger datasets, it is recommended to use the model’s built-in feature importance list as the primary feature selection method over SHAP. This suggestion is based on the rationale that computing SHAP feature importance is a distinct activity, while models naturally provide built-in feature importance as part of the training process, requiring no additional effort. Consequently, opting for the model’s built-in feature importance list can offer a more efficient and practical approach for larger datasets and more intricate models.

Introduction

Detecting credit card fraud is crucial within the finance industry and heavily relies on the information stored in transaction datasets. However, the finance field and machine learning face a significant research challenge due to the quality of data, as it directly influences decisions made during modeling and analysis [1, 2]. To tackle this issue, we delve into the available feature space, extracting a pertinent set of features. This underscores the importance of feature selection as an essential data cleansing step before engaging in any modeling process. Feature selection has found application in various contexts within data mining and machine learning, with the goal of removing irrelevant or redundant features from the analysis. This not only results in expedited model training but also enhances classifier performance.

This study delves into a comparison between two feature selection methods: Shapley Additive exPlanation (SHAP)-value-based selection [3] and commonly used importance-based selection [4, 5]. SHAP leverages game theory concepts to compute feature importance in two steps: training a classification model using all features in the initial interaction and then computing SHAP values for each feature, subsequently ranking them to identify the most significant features for modeling the target problem. On the other hand, importance-based selection computes feature importance for all features during the model training process. Both methods are embedded since they involve the model-building process. In our feature selection process, we utilize five learners: Extreme Gradient Boosting (XGBoost) [6], Decision Tree (DT) [7], CatBoost [8], Extremely Randomized Trees (ET) [9], and Random Forest (RF) [10]. The selection of these five learners is based on their ability to generate an importance ranking list during the model-building process. LightGBM [11] was not included in our choices due to its poor performance, as indicated by our preliminary results in comparison to other learners. We have designated the SHAP-value-based methods as SHAP-XGBoost, SHAP-DT, SHAP-CatBoost, SHAP-ET, and SHAP-RF, while referring to the importance-based methods simply as XGBoost, DT, CatBoost, ET, and RF. In total, there are 10 feature selection methods, five from each category.

To conduct our study, we focus on the Credit Card Fraud Detection Dataset, a set of anonymized financial transactions available on Kaggle [12]. This dataset is the only publicly available large data for credit card fraud analysis. Hence the scope of the study is limited to one dataset. With 284,807 transactions and 30 independent features, only 492 (0.172%) records are labeled fraudulent. Using two different feature selection methods, we assess the performance of five sets of classifier models using different feature selection techniques (SHAP-XGBoost vs. XGBoost, SHAP-DT vs. DT, SHAP-CatBoost vs. CatBoost, SHAP-ET vs ET, and SHAP-RF vs RF) with their respective selected features. The top 3, 5, 7, 10, and 15 features are selected based on their respective scores. For classification, we build credit card fraud detection models using the five classifiers, the same models used in feature selection. The classifiers are evaluated using the Area Under the Precision Recall Curve (AUPRC) metric [13], and we additionally perform a statistical test with a significance level of $\alpha =0.01$ to assess the statistical significance of our results.

To the best of our knowledge, this study is the first comprehensive empirical investigation comparing the performance of SHAP-value-based feature selection and importance-based feature selection in the context of fraud detection and potentially other application domains in machine learning.

The remainder of the paper is organized as follows. We begin with an overview of related work, which shows the novelty of the research work we exhibit here. Following that we present the methodology used in the experiment, including explanations of two feature methods, classifiers, cross-validation, and performance metric. We then describe the datasets, experimental design, and experimental results. Finally, we conclude the article with key highlights of this study, and offers suggestions for future work.

Related work

Feature selection is a widely used technique in various data mining and machine learning applications. Its primary objective is to identify a subset of features that minimizes prediction errors for classifiers. In this study, we conducted a comprehensive literature review of research that employs either SHapley Additive exPlanations (SHAP) values or the model’s built-in feature importance list for feature selection. While we found a limited number of studies that utilized the model’s built-in feature importance list for feature selection in the context of the Credit Card Fraud Detection Dataset, we did not come across any studies that used SHAP for feature selection specifically in credit card fraud detection. Instead, we found a few studies that applied SHAP for feature selection in other application domains. Moreover, we did not encounter any studies that directly compared the performance of models built with features selected by SHAP feature importance versus models built with features selected by built-in feature importance. Therefore, our study presents a unique contribution to the field of credit card fraud detection, as it explores the comparison between SHAP and the model’s built-in feature importance list for feature selection, a perspective that has not been extensively explored in the existing literature.

Rtayli and Enneya [14] applied a supervised feature selection method, Random Forest, to identify the most predictive features. Random Forest (RF) is an ensemble learning algorithm that is trained in parallel through bagging [15]. Recently, RF has been increasingly exploited as a feature selection method because it can handle complex, high-dimensional datasets and can detect interactions between features. It also reduces the risk of overfitting, which occurs when a model is too complex and fits the training data too closely. Moreover, RF calculates the feature importance by measuring the decrease in the impurity of the node when the feature is used for the split. The more the impurity decreases, the more important the feature is considered. By ranking the features based on their importance, RF can help select the most relevant features for the classification task. After selecting a feature subset from the Credit Card Fraud Detection Dataset, the authors ran Support Vector Machine to find fraudulent transactions. The model achieved an Accuracy of 95.12%, a Sensitivity of 87%, and an AUC of 0.91, outperforming three other models (Isolation Forest, Decision Tree, and Local Outlier Factor). The study does not provide clear information regarding the number of selected features. Additionally, the authors did not conduct a comparison of the performance between the selected features and the usage of all the available features. Furthermore, it is worth noting that the use of AUC as a metric for classification of imbalanced data has been found to be misleading [16].

In their study using the Credit Card Fraud Detection Dataset [12], Rosley et al. [17] first filtered out the data with a z-score greater than or equal to 3 and then normalized the remaining data using min-max scaling. Then they used Boruta to compute the importance score of each feature. Boruta [18] is a supervised feature selection algorithm that is designed as a wrapper around a Random Forest classifier to identify important features in a dataset. They kept the features with an importance score of 0.5 or higher to train the Autoencoder for each iteration. The model detected credit card fraud by defining a threshold in the reconstruction error to flag the transactions as legitimate or fraudulent. However, the number of features selected in the preprocessing step has not been specified by the authors. The authors evaluated the models using Accuracy, Precision, Recall, and F1 score. When working with datasets that exhibit significant class imbalance, these may not be suitable metrics due to the overwhelming size of the majority class.

Waspada et al. [4] use the RF classifier to calculate the importance score of each feature. Features with a low importance score are discarded. The paper lists the importance score of all features. The authors analyze several factors (dataset split ratio, the selection of top k features, the amount of fraud data on training data, and the setting of hyper-parameter values) that influence the performance of the Isolation Forest (IF) model to detect fraud on credit card transactions. Isolation Forest is a popular unsupervised outlier detection method. Their findings indicate that the best results can be obtained by setting training–testing ratio of 60:40, using the top five features ($V_{14}, V_{4}, V_{17}, V_{12}, V_{11}$), using only 60% of fraud data, and setting hyper-parameters with the number of trees 100, 128 sample maximum, and 0.001 contamination. The model shows impressive results obtaining precision of 80.7143%, recall of 76.3514%, F1 score of 78.4722%, Area Under the Receiver Operating Characteristic Curve (AUC) of 0.97371, and Area under the Precision-Recall Curve (AUPRC) of 0.759228. Waspada et al. utilized only a single importance-based feature selection method and did not incorporate SHAP for feature selection, which we have implemented in our study.

In their study, Liu et al. [19] utilized SHAP for feature selection on the UCI Parkinson’s disease medical dataset [20]. They combined SHAP values with four classifiers: Deep Forest (gcForest), Extreme Gradient Boosting (XGBoost), Light Gradient Boosting machine (LightGBM), and Random Forest (RF). Each classifier was used to calculate the SHAP values of individual features. To assess the effectiveness of SHAP feature selection, they compared it with three filter-based feature selection methods: Fscore, analysis of variance (Anova-F), and Mutual Information. The experiments were conducted with a training and testing ratio of 70:30, and the feature selection was applied to the training dataset. The results showed that the gcForest model based on SHAP value feature selection achieved an impressive classification Accuracy of 91.78% and an F1-score of 0.945, with 150 features selected. This performance surpassed the outcomes of other feature selection methods considered in their study. While the authors specifically employed SHAP-value-based feature selection on the training dataset, we utilized the SHAP method across the entire dataset and subsequently conducted cross-validation following the feature selection procedure.

Marcilio and Eler [21] employed the SHAP method as a feature selection technique and compared it against three widely used feature selection methods: Mutual Information, Recursive Feature Elimination, and ANOVA. The SHAP process involved utilizing XGBoost as the underlying model. They conducted experiments on five UCI datasets using the XGBoost classifier and three other UCI datasets using the XGBoost regressor. The results of their study revealed that SHAP outperformed the three commonly used methods in terms of the Area Under the Receiver Operating Characteristic Curve (AUC) metric. However, it was observed that SHAP required more computational time compared to the other feature selection methods. It is worth noting that the datasets used in Marcilio and Eler’s experiments are not highly imbalanced, and not in the credit card fraud domain. In addition, the datasets are significantly smaller in size compared to the Kaggle Credit Card Fraud Detection Dataset, which caught our attention.

In our review of the literature, we discovered that only a single method of feature selection, either based on SHAP values or importance, was employed. Notably, no research has been identified that compares these two methods, particularly within the domain of credit card fraud detection. In order to fill this gap, our study undertook a comparative analysis of these two feature selection methods, employing five learners in each approach.

Methodology

Importance-based feature selection methods

Importance-based feature selection methods leverage decision trees to identify relevant features from a given dataset. These decision tree-based classifiers, such as Extreme Gradient Boosting (XGBoost) [6, 22], Extremely Randomized Trees (ET) [9], Random Forest (RF) [23], CatBoost [8], and Decision Tree [7], possess a built-in capability to determine feature importance during model fitting in supervised machine learning. Consequently, they can rank features based on their significance in classification tasks, making them valuable for feature selection. By discarding less relevant features and retaining the most important ones, more efficient and accurate models can be created.

In this study, five importance-based feature selection methods were employed: XGBoost [22], Decision Tree (DT) [7], CatBoost [8], Extremely Randomized Trees (ET) [9], and Random Forest (RF) [10].

XGBoost and CatBoost stand out as widely used gradient boosting algorithms, each employing distinct approaches to compute feature importance scores. While both algorithms construct ensembles of decision trees, their methodologies for deriving feature importance scores vary. In XGBoost, these scores are calculated using the “gain” method, evaluating the influence of each feature on model performance throughout the boosting process. In contrast, CatBoost’s ensemble of decision trees calculates feature importance based on the frequency of a feature being utilized for splitting and the subsequent improvement in model performance achieved through those splits.

A Decision Tree classifier is a type of machine learning algorithm used for classification tasks. It constructs a tree-like model of decisions and their potential outcomes by recursively splitting the data based on the most informative features at each node. Decision trees generate feature importance scores by evaluating their ability to reduce Gini impurity (or increase purity) within the data as the tree is built.

Extremely Randomized Trees and Random Forest, both rooted in decision tree ensembles, share common principles like Gini impurity and the Mean Decrease in Impurity to gauge feature importance. However, Extremely Randomized Trees introduce heightened randomness in the decision-making process during tree construction. This added stochasticity can result in divergent importance scores, potentially impacting the balance between model bias and variance.

SHAP-value-based feature selection methods

Shapley Additive exPlanation (SHAP), introduced by Lundberg and Lee [3], has gained popularity as a method for interpreting machine learning model predictions. By utilizing Game Theory techniques [24], SHAP provides insights into the contribution of each feature to specific predictions. It falls under a family of additive feature attribution techniques that remain model-agnostic, making them universally applicable to various machine learning and deep learning models. These techniques attribute significance to individual input features, facilitating better understanding of model behavior.

In the context of feature selection, SHAP-based methods work as follows: classification models, such as XGBoost and Decision Tree in this study, are trained on the entire dataset. Subsequently, SHAP values are computed for each instance, and these values are then aggregated across the dataset to derive average absolute values for each feature. The computation of SHAP values becomes computationally complex due to this process. The average SHAP value indicates the typical impact of each feature on model predictions across the entire dataset, while the absolute SHAP value represents the feature’s importance, irrespective of its direction (positive or negative). By sorting features based on their average absolute SHAP values in descending order, features with higher SHAP values are identified as more influential in influencing the model’s predictions.

Classification

In this study, credit card fraud detection models were built with five different classifiers, namely XGBoost [6], Decision Tree (DT) [7], CatBoost [8], Extremely Randomized Trees (ET) [9], and Random Forest (RF) [10]. Among these five learners, XGBoost, CatBoost, ET, and RF are ensemble of Decision Tree-based classifiers [25]. We select these learners on the basis that they are highly effective for dealing with complex, high-dimensional data and are known for their excellent performance in a wide range of classification tasks [25].

XGBoost and CatBoost are all gradient boosting frameworks that are widely used for machine learning tasks, particularly for classification. These two algorithms are known to be highly effective and produce accurate predictions. However, the performance may vary depending on the specific dataset and problem at hand. XGBoost is an advanced refinement the Gradient Boosted Decision Tree (GBDT) ensemble method. GBDTs were initially introduced by Friedman in 2001 [26]. XGBoost enhances GBDTs in multiple ways. Firstly, it employs an improved loss function during training that includes an additional term for regularization, effectively preventing overfitting. Secondly, XGBoost introduces an “approximate algorithm” for calculating splits in the constituent decision trees, which is highly suitable for distributed environments and cases where the entire dataset cannot fit into main memory. Moreover, XGBoost incorporates a specialized algorithm for handling sparse data, where most values are nearly constant with occasional aberrations. The “sparsity aware split finding” feature enables XGBoost to capitalize on sparse data efficiently. CatBoost, on the other hand, is known for its robustness in handling categorical features and missing values, making it suitable for datasets with such characteristics. CatBoost’s core algorithm is Ordered Boosting, which involves sorting the instances used by Decision Trees. In contrast, XGBoost relies on a weighted quantile sketch and a function that takes into account sparsity. A weighted quantile sketch is an approximate tree learning [27] technique that is utilized for merging and pruning operations, while sparsity deals with values that are either zero or missing.

Breiman introduced the concept of Bagging in the domain of machine learning in a 1996 paper [28]. As our research revolves around binary classification, our focus is on Breiman’s ideas about Bagging applied to binary classification. Extremely Randomized Trees (ET) and Random Forest (RF) are both ensemble learning algorithms that belong to the bagging family of decision tree-based methods. Random Forest, which was introduced by Breiman [10]. Random Forest builds upon the Bagging principle with an added improvement. In a Random Forest, each tree is constructed using a random subset of features and samples. This randomness helps to decorrelate the trees and reduce overfitting. Extremely Randomized Trees extends the concept of Random Forest by selecting values for Decision Tree splits at random, potentially making them more robust and computationally efficient in some scenarios. The choice between the two often depends on the specific characteristics of the data and the desired trade-off between bias and variance. We skip the detailed information about these learners and readers are referred to [25].

Decision Tree (DT) is a widely used supervised machine learning algorithm, prominently applied to classification and regression tasks. It is a non-linear model that recursively partitions input data into subsets based on feature values. Each node in the decision tree represents a decision based on a specific feature and threshold, facilitating predictions based on the input data’s feature values. The resulting decision tree structure is highly interpretable, with each internal node representing a feature-based decision, edges signifying outcomes, and leaf nodes providing predictions.

To ensure the reproducibility of our results, we modified specific hyperparameter settings from their default values as listed in Table 1. Furthermore, we set random number generator seeds for all classifiers to ensure consistent and repeatable outcomes. All other settings were left at their default values. The determination of tree depths was guided by previous experimentation documented in [1], aiming to achieve a suitable trade-off between capturing complex patterns in the data and mitigating overfitting.

Table 1 Hyperparameter settings used in experiments

Feature selection strategies: a comparative analysis of SHAP-value and importance-based methods

Abstract

Introduction

Related work

Methodology

Importance-based feature selection methods

SHAP-value-based feature selection methods

Classification

Performance metric

Cross-validation

Experiments

Dataset

Experimental design

Results and discussion

Conclusion

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords