A novel sensitivity-based method for feature selection

Sensitivity analysis is a popular feature selection approach employed to identify the important features in a dataset. In sensitivity analysis, each input feature is perturbed one-at-a-time and the response of the machine learning model is examined to determine the feature's rank. Note that the existing perturbation techniques may lead to inaccurate feature ranking due to their sensitivity to perturbation parameters. This study proposes a novel approach that involves the perturbation of input features using a complex-step. The implementation of complex-step perturbation in the framework of deep neural networks as a feature selection method is provided in this paper, and its efficacy in determining important features for real-world datasets is demonstrated. Furthermore, the filter-based feature selection methods are employed, and the results obtained from the proposed method are compared. While the results obtained for the classification task indicated that the proposed method outperformed other feature ranking methods, in the case of the regression task, it was found to perform more or less similar to that of other feature ranking methods.

feature selection algorithm is integrated into the learning algorithm [5,9,13]. Examples of the embedded method include decision tree, random forest, support vector machine recursive feature elimination (SVM-RFE). When compared to filter-based approaches, the embedded approach yields higher accuracy because of its interaction with a specific classification model. A comprehensive review of these three methods' description and comparison is discussed by various researchers in the literature [4,5,[14][15][16][17][18][19].
In hybrid methods, multiple conjunct primary feature selection methods are applied consecutively [6]. For instance, Liu et al. [20] proposed a hybrid feature selection method in which mutual information was first applied to identify the relevant features from the feature set, and then the wrapper method was applied subsequently to choose the subset of best features from the relevant features. Ensemble feature selection methods use an aggregate of feature subsets of diverse base classifiers [6]. For instance, Hoque et al. [21] proposed an Ensemble Feature Selection-Feature Selection (EMI-FS) in which information gain, gain ratio, ReliefF, symmetric uncertainty, and Chi-square were employed as base filter methods to obtain the relevant subset of features which were subsequently combined to extract the optimal subset. In the integrative feature selection method, the external knowledge of feature selection is integrated [6]. For example, Cindy et al. [7], proposed an integrative gene selection approach in which gene rankings are determined by considering both the statistical significance of a gene in the dataset and the biological background information acquired through research. In this paper, we restrict our scope to the embedded feature selection methods that incorporate feed-forward neural networks/multi-layer perceptron as the learning models.
Multi-layer Perceptron (MLP) is a basic type of neural network that learns a function g : R q → R m by training on a dataset, where q is the number of inputs and m is the number of outputs. MLP's were employed for performing feature selection by various researchers in the past. For instance, Setiono and Liu [22] developed a neural network feature selector method based on backward elimination wherein weights of low magnitude in the network were converged to zero by adding a penalty term to the error function. Sindhwani et al. [23] presented a maximum output information algorithm for feature selection. Liefeng Bo [24] proposed MLP Embedded Feature Selection (MLP-EFS), in which each feature is multiplied by the corresponding scaling factor. By applying truncated Laplace prior to the scaling factors, feature selection is integrated into MLP-EFS.
Notwithstanding to methods mentioned above, sensitivity analysis of MLP and support vector machines (SVM) was also carried out to perform feature selection. For instance, Ruck et al. [25] developed a technique that analyzes the weights in MLP to determine essential features. Gasca et al. [26] proposed a saliency measure that estimates the input features' relative contribution to the output neurons. Utans et al. [27] proposed a 'sensitivity-based-pruning (SBP)' to remove irrelevant input features from a nonlinear regression model. Acir et al. [28] implemented the perturbation method in the framework of SVM to perform feature selection for classification of Electrocardiogram (ECG) beats. Sensitivity analysis examines the change in the target output when one of the input features is perturbed, i.e., first-order derivatives of the target variable with respect to the input feature are evaluated. Herein we refer the first-order derivative term as the feature sensitivity metric. The higher the magnitude of change in feature sensitivity metric, the higher is the importance of input feature. At this juncture, it is important to note that sensitivity analysis methods involve computation of the feature sensitivity metric or first-order derivative for identifying important features. In general backpropagation algorithm (for MLP), is employed or finite difference schemes [29][30][31][32] is used for computing feature sensitivity metric. Employing numerical differentiation techniques such as finite difference approximation (FDA) (see Eq. 1) and central finite difference approximation (CFDA) (see Eq. 2) results in inaccurate computation of derivatives [33,34] because of inappropriate choice of step size. For instance, Juana et al. [35] introduced the iterative perturbation method for auto-tuning the step size for SVM. Such errors arising due to the choice of smaller step sizes are referred to as subtractive cancellation errors.
Finite difference approximation (FDA) is the function mapping the inputs to the output variable and, g ′ (.) is the first partial derivative approximation of f (.) with respect to the input x k . The feature x k is perturbed in both the cases to get the first derivative as seen in Eq. (1) and (2). In this paper, a novel Complex-step sensitivity analysis-based feature selection method referred to as CS-FS is proposed, which incorporates a complex-step perturbation of the input feature to compute the feature sensitivity metric and identify the important features. It evaluates the analytical quality first-order derivatives without the need for extra computations in neural networks or SVM machine learning models. A brief overview of the complex step perturbation approach is provided in section "Overview of complex-step perturbation approach (CSPA)", and its implementation in the framework of FFNN to perform feature selection is described in section "Complex-step feature selection method". The details of the dataset are provided in section "Numerical experiments" and the efficacy of the proposed method is then demonstrated on real-world datasets in section "Results", and the summary and future work are provided in Section "Summary and future work".

Overview of complex-step perturbation approach (CSPA)
CSPA, originally referred to as complex-step derivative approximation (CSDA), was proposed by Lyness and Moler [36] to evaluate the first-order derivative of analytic functions. A simplified version of mathematical derivation for computing the firstorder derivative of a scalar function using complex-step perturbation was then provided by Squire and Trapp [37] which is as follows.
Consider a holomorphic function f (.) which is infinitely differentiable. The Taylor series expansion of the function f (.) evaluated at the complex perturbed point x 0 + ih is expressed as where, h is the step size and i 2 = −1.
By taking the imaginary component of f (x 0 + ih) , and truncating the higher-order terms in the Taylor series, the first-order derivative can be expressed as where, Imag (*) denotes the imaginary component and O h 2 is the second-order truncation error. It is evident from Eq. 4 that the first-order derivative evaluated using the CSPA technique is not prone to subtractive cancellation errors (see Eq. 1 and Eq. 2) due to the absence of subtractive operations. Furthermore, a choice of the small magnitude of h could possibly eliminate the truncation error O h 2 too. A simple example illustrating the accuracy of CSPA over finite difference schemes can be found elsewhere [38,39]. Some examples of the fields where CSPA is currently gaining a lot of attention for performing sensitivity analysis includes aerospace [40][41][42][43], computational mechanics [38,39,44], estimation theory (e.g., second-order Kalman filter) [45].

Complex-step feature selection method
In the proposed method, we implement a complex-step perturbation in the framework of feed-forward neural networks to illustrate the task of feature selection. Note that this could be extended to other ML models such as SVM whose decision function is holomorphic. Higher the change in the magnitude of the output variable y ∈ R of the FFNN with respect to the input feature x k ∈ R , higher is the importance of the feature x k . For a multivariate function, the extended form of CSPA can be expressed as is the function mapping the input features to the output target variable and, g ′ (.) is the first-order derivative approximation of g(.) with respect to the k th input feature x k .

Feature selection for regression using complex-step sensitivity
The proposed feature selection method for the regression task involves four steps (see Fig. 1). In the first step, an FFNN is configured and trained for a given dataset. Configuring the FFNN is a trial-and-error process that involves finding the appropriate number of neurons and hidden layers in a network. A neural network is said to be configured when it is capable of learning a mathematical mapping between the input features and the associated target variable such that it could be generalized to the unseen data instances.
In the second step, one of the input features, x k is chosen at a time and is perturbed with . Feedforward operation is then performed with the perturbed feature on the trained FFNN, and the results in the output layer are obtained. In the third step, the imaginary components of the output neurons' results are extracted for each perturbed feature and are divided with the step size (h) (see Eq. 5), i.e., the first-order derivative of the target output with respect to the input feature is evaluated. Note that step 2 and step 3 are repeated for all instances in the dataset, and the average absolute magnitude of the first-order derivative of the target output with respect to the input feature is evaluated. For example, if y is the target output variable and x jk is the kth feature in the jth observation that is complex-step perturbed ( ih ), then the first order derivative of the target output with respect to the input feature averaged over all instances of datasets is expressed as (see Eq. 6) where, N denotes the number of instances in the dataset, k = 1 . . . q indicates the input feature, and j represents the observation number in the dataset. In the fourth and final step, the rank of each input feature is determined based on the magnitude of the firstorder derivatives evaluated, as shown in Eq. 5. The feature with a higher magnitude of the first-order derivative is assigned a higher rank and vice versa. Note that for training the feedforward neural network, a backpropagation algorithm, in conjunction with the Levenberg-Marquardt optimization technique, is employed in this study [46].

Feature selection for classification using complex-step sensitivity
Unlike regression, a modification to step 3 is needed in the proposed method when feature selection is performed on the classification task, i.e., evaluating the firstorder derivative of target output with respect to perturbed input feature. The need Steps involved in the complex-step sensitivity for regression task for modification could be attributed to two reasons: (1) discrete output in the output layer and (2) multiple first-order derivatives yielded by the feed-forward neural network output layer (SoftMax layer) (see Fig. 2). Considering the fact that the inputs fed to the SoftMax activation neurons in the output layer are not discrete, the first-order derivatives of such inputs could still be evaluated. These first-order derivatives will aid in providing information about the importance of the input features. If r represents the net function of rth neuron in the SoftMax layer, then the first-order derivative of the net function r with respect to the kth feature x k is expressed as (see Eq . 7) where, r = 1 . . . ..m and m indicates the number of class labels. To quantify the change in the target output with respect to the kth input feature x k , the average of the first-order derivatives obtained for all neurons in the output layer is determined. This average magnitude is referred to as saliency ( S k ) of kth input feature [25] and is expressed as (see Eq. 8): where r denotes the neuron in the SoftMax output layer, m represents the number of class labels, r represents the net function of rth neuron in the SoftMax layer. The rank of each input feature is then determined based on the magnitude of the first-order derivatives for each perturbed feature x k determined as shown in Eq. 8. Fig. 2 Steps involved in the complex-step sensitivity for the classification task

Numerical experiments
In this section, numerical experiments are performed to demonstrate the effectiveness of the proposed method.

Datasets
Three real-world datasets, each for regression and classification problems, are employed to demonstrate the proposed method's efficacy. The datasets are obtained from the UCI open-source data repository [47]. For regression problems, the body fat percentage dataset, abalone dataset, and wine quality dataset are chosen, and, for the classification task, a vehicle dataset, segmentation dataset, and breast cancer dataset are chosen. One of the main reasons for choosing these datasets is that they are commonly adopted in the literature of feature selection. On the other hand, the results obtained from some of the chosen datasets such as body fat percentage, wine quality, segmentation are easily interpretable and aids in ensuring the verification of the proposed method. While most of the chosen datasets have descriptive features that are continuous in nature, the proposed method can be extended to the datasets consisting of discrete input features. The descriptive features and target variables for each dataset are mentioned as follows.

Configuring feed-forward neural networks
Feed-forward neural networks (FFNN) with three hidden layers (HL) are configured to train on the regression and classification datasets. While a configuration of 1st HL-20 neurons, 2nd HL-10 neurons, and 3rd HL-5 neurons is employed to train on regression datasets, a configuration of 1st HL-60 neurons, 2nd HL-40 neurons, and 3rd HL-20 neurons is employed to train on classification datasets. A Rectified Linear Unit (ReLU) nonlinear function is used as an activation function for all the configurations [53]. Note that different architectures and model parameters yield different results if a suitable configuration is not adopted. In this study, various trail configurations of increased complexity (i.e., more hidden neurons and hidden layers) were examined before choosing a suitable configuration. Herein, the suitable configuration refers to the model architecture for which further improvement in performance was not observed with an increase in complexity of architecture. For training, validating, and testing the chosen configurations, the datasets are randomly partitioned into 70:15:15 ratio,  respectively. Note that in the case of the classification task, the partition ratio is maintained consistently for each class label, i.e., 70:15:15 of training, validation, and testing data from each class label is chosen. To ensure that the chosen configurations yield repeatable results, the training operation is performed 100 times with the same partition ratio but with the replacement of instances randomly selected in every iteration. The performance metric, namely mean squared error (MSE) and accuracy, are evaluated for regression and classification datasets, respectively, for chosen configurations. The average MSE error for body fat percentage, abalone, and wine quality datasets is determined to be 20.41, 4.6, and 0.53, respectively. The average accuracy for the vehicle, segmentation, and breast cancer dataset is determined to be 75%, 80% and, 90%, respectively. The addition of more hidden layers or neurons in each hidden layer to the chosen configuration was found to yield similar MSE errors or accuracies and hence are not considered in this study.

Results
Followed by the determination of FFNN configuration, the rank of the features in each dataset is evaluated using the proposed method. Furthermore, other feature ranking methods are also considered in this study for the sake of comparison. An open-source software WEKA is employed for this purpose. While feature ranking methods such as Pearson correlation coefficient, ReliefF and, mutual information are used for regression task, symmetric uncertainty, information gain, gain ratio, reliefF and, chi-square is employed for the classification task. The efficacy of all feature ranking methods is then assessed by evaluating the performance of FFNN, wherein the size of the input layer is increased by one feature in each succession. In other words, the performance of FFNN for the only top-most feature is first assessed, and then the process is repeated by including the second most important feature and so on.

Regression
From Table 3, it can be inferred that all four feature ranking methods yielded feature 6 (Abdomen) as the most important feature and feature 10 (Ankle) as the least relevant feature for determining the percentage of body fat. While the top six features determined using Pearson correlation coefficient, ReliefF and, mutual information method are noticed to be similar; the proposed method yielded different feature ranks. Furthermore, the MSE for body fat dataset with each feature's inclusion is evaluated for all four feature ranking methods and is shown in Fig. 3a. From Fig. 3a, it is evident that the overall trend of MSE for FFNN decreases with the inclusion of each feature. While the proposed method was found to yield lower MSE with only seven top-most features, the mutual information method yielded lower MSE for eleven features for the bodyfat dataset. In other words, the filter based approach was found to be ineffective at determining a subset of important features that could reduce the MSE. According to the proposed method, following features are found to be least important as they do not contribute further for reduction of MSE: (5) Chest (cm), (7) Hip (cm), (9) Knee (cm), (10) Ankle (cm), (11) Biceps (cm), (12) Forearm (cm).
In the case of the abalone dataset, the least relevant features are determined to be the same by all four feature ranking methods, i.e., feature 1 (female), feature 2 (infant), and feature 3 (male) are identified to be the least relevant (see Table 3). While the remaining seven features' rank was found to vary, feature 10 (shell weight) and feature 7 (whole weight) were common in the top four features for all feature (a) (b) (c) Fig. 3 Comparison of the complex-step sensitivity method with other feature selection methods for regression task ranking methods, including the proposed method. Similar to the body fat dataset, the MSE of FFNN with the inclusion of each feature is determined for all feature ranking methods and is shown in Fig. 3b. From Fig. 3b, it can be inferred that the trend of ReliefF and the proposed method are similar. Both ReliefF and the proposed method identified feature 5 (diameter), feature 6 (height), feature 7 (whole weight), and feature 10 (shell weight) as the top 4 features that yield the lowest MSE. In other words, ReliefF was found to be effective among all the filter-based methods. According to the proposed method, following features are found to be least important as they do not contribute further for reduction of MSE: (1) Female, (2) Infant, (3) Male, (4) Length (gms.), (8) Shucked weight (gms.), (9) Viscera weight (gms.). Interestingly, in the wine quality dataset, all four feature ranking methods yielded different ranks for the features (see Table 3). However, feature 11 (alcohol) is determined to be one of the top two features by all four feature ranking methods. Furthermore, feature 6 (free sulfurdioxide) is determined to be common among first four features determined by all feature ranking methods except mutual information. The MSE of FFNN with each feature's inclusion is determined for all feature ranking methods and is shown in Fig. 3c. The trend obtained in Fig. 3c, reveals that all feature ranking methods performed more or less similar. Table 4 Important features identified by various feature selection methods for classification task (ranked in the descending order of their importance)

Summary and future work
A novel complex-step sensitivity analysis-based feature selection method is proposed in this study for regression and classification tasks. A step-by-step process involved in implementing the proposed method in the framework of FFNN is described, and its efficacy on real-world datasets is demonstrated. Three real-world datasets, namely, body fat percentage dataset, abalone dataset, and wine quality dataset, are chosen for the regression task and, three datasets, namely vehicle dataset, segmentation dataset, and breast cancer dataset, are chosen for the classification task. While the proposed method was found to outperform other popular feature ranking methods for classification datasets (vehicle, segmentation, and breast cancer), it was found to perform more or less similar with other methods in the case of regression datasets (body fat, abalone, and wine quality). An average MSE of 20.41, 4.6, and 0.53 were observed for body fat, abalone, and wine quality datasets, respectively, and an average accuracy of 75%, 80%, and 90% was observed for the vehicle segmentation and breast cancer datasets, respectively. Furthermore, the top-most relevant features and irrelevant features are identified for all the employed datasets. At this juncture, it is also important to note that the proposed method possesses the advantage of performing sensitivity analysis through the forward propagation of FFNN, i.e., no backpropagation is required for evaluating the derivatives.
In future work, the authors intend to extend the proposed method to the multiple output regression problems. In addition to this, the authors would also like to investigate the influence of different activation functions (e.g., Sigmoid, tanh, Softplus, Leaky ReLU, etc.). Other supervised ML classification algorithms will be employed, and the efficacy of the proposed method will be examined. Note that often complete dataset may not be required for training the FFNN when the size of the dataset is large. Hence the influence of a number of instances on the determination of the important features would also be studied. Furthermore, the proposed method would also be extended to the datasets that consists of discrete and continuous features and also include redundant features.