Novel sensitivity method for evaluating the first derivative of the feed-forward neural network outputs

Evaluating the exact first derivative of a feedforward neural network (FFNN) output with respect to the input feature is pivotal for performing the sensitivity analysis of the trained neural network with respect to the inputs. In this paper, a novel method is presented that computes the analytical quality first derivative of a trained feedforward neural network output with respect to the input features without the need for backpropagation. To this end, the complex step derivative approximation is illustrated, and its implementation in the framework of the feedforward neural network is described. Artificial datasets are generated, and the efficacy of the proposed method for both regression and classification tasks is demonstrated. The results obtained for the regression task indicated that the proposed method is capable of obtaining analytical quality derivatives, and in the case of the classification task, the least relevant features could be identified.

In FFNNs the first derivative (i.e., ∂y ∂x k ) of the output y with respect to the k th input x k is evaluated employing the backpropagation algorithm, which involves the application of derivative chain rule [8][9][10][11]. Application of chain rule in this context is similar to the one employed during the training of an FFNN where ∂E ∂w ij is evaluated for backpropagating the error with respect to the parametric weights w ij of the network [12][13][14][15]. The goal of this work is to evaluate the derivatives of the FFNN outputs with respect to the inputs without the need for backpropagation employing numerical differentiation techniques.
Finite difference schemes are employed for evaluating numerical derivatives [16][17][18]. In finite difference schemes, the input features are perturbed one at a time (e.g. x k ) with a finite step size ( h) and the change in the output of a trained FFNN is obtained. Popularly employed finite difference schemes include finite difference approximation (FDA) (see Eq. 1) and central finite difference approximation (CFDA) (see Eq. 2) methods, which are given as follows.
Finite difference approximation (FDA) Central finite difference (CFDA) where x = (x 1 , x 2 , . . . x k , . . . x q ) ′ ∈ R q×1 are the inputs, q is the number of inputs, f (.) is the function mapping the inputs to the output variable and, f ′ (.) is the first partial derivative approximation of f (.) with respect to the input x k . However, finite difference schemes are prone to subtractive cancellation errors [19,20]. Subtractive cancellation errors are caused by subtracting two close numbers whose difference could be in the order of the precision of the calculations. This scenario is inevitable in the case of finite difference schemes due to the subtractive operation as seen in the numerators of Eq. 1 and Eq. 2 and the use of very low h values to lower the truncation errors [19]. With this, an additional computational step to evaluate the ideal h value to minimize the truncation error without increasing the subtractive cancellation error is necessary when finite difference schemes are evaluated. A novel differentiation scheme is necessary to avoid this additional step and to achieve analytical quality derivatives by minimizing both truncation and subtractive cancellation errors. In this study, a novel method for determining the analytical quality first derivative of feedforward neural network outputs is proposed and implemented. To this end, the concept of complex-step derivative approximation (CSDA) is described, and its ability to circumventing the subtractive cancellation errors associated with other numerical differentiation techniques is illustrated in "Complex-step derivative approximation (CSDA)" section. Implementing CSDA in the framework of FFNN for regression and classification tasks is demonstrated in "Implementation of CSDA in feed-forward neural networks" section, and future areas of improvement are mentioned in "Summary and future work" section. (1)

Complex-step derivative approximation (CSDA)
CSDA is a numerical differentiation technique proposed by Lyness and Moler [21]. CSDA was successfully implemented in various fields of engineering, including aerospace [22][23][24][25], computational mechanics [26][27][28], estimation theory (e.g., secondorder Kalman filter) [29], etc., for performing sensitivity analysis and evaluating the first-order derivatives. In this section, the mathematical description of CSDA to estimate analytical quality first-order derivative of a single scalar variable scalar function is provided [30]. Let f be an analytic function of a complex variable z . Also, assume that f is real on the real axis. Then f has a complex Taylor series expansion which is expressed as where, h is the step size and i 2 = −1 . By taking the imaginary component of f (x + ih) , dividing it by the step size and truncating the higher-order terms in the Taylor series, the CSDA for the first derivative can be expressed as where Imag (*) denotes the imaginary component and O(h 2 ) is the second-order truncation error. It is interesting to note that there are no subtractive operations in Eq. 4, which are inevitable in the finite difference approximations (see Eq. 1 and Eq. 2). The absence of subtractive operations in the numerator ensures that the CSDA is not prone to subtractive cancellation errors. Hence, a very small value of h can be chosen in order to eliminate the truncation errors without the fear of subtractive cancellation errors. A simple example is provided next, which illustrates the accuracy of CSDA over finite difference schemes.

Illustrative example
Consider a smooth function f (x) provided in Eq. 5. The exact first-order derivative of the function computed at x = π 4 is given as 2.65580797029498.
The numerical first-order derivative of the above function is evaluated using all three approximation methods, namely, finite difference approximation (Eq. 1), central finite difference approximation (Eq. 2), and CSDA (Eq. 4). The step size h employed for the purpose of computation ranged from 10 −1 to 10 −16 . The absolute error (ε ) for each step size is then evaluated using Eq. 6, and the results are shown in Fig. 1.
where f ′ (x) is the approximate first derivative at x = π 4 for a chosen step size h and, f ′ (x) is the exact first derivative of function f (x) at x = π 4 .
(3) Fig. 1, it can be noticed that the absolute error decreased initially for both FDA and CFDA with the reduction in the step size. However, for step sizes less than h = 10 −8 for FDA and h = 10 −5 for CFDA, the absolute error was found to increase. The increase in the absolute error after a certain step size can be attributed to the subtractive operation in the numerator of finite difference schemes. On the contrary, in the case of CSDA, the absolute error was not only found to decline with a reduction in step size but approached a double float precision (~ 10 −16 ) with a further decrease in the step size beyond h = 10 −7 . In other words, no subtractive cancellation errors were observed, and hence analytical quality derivatives with errors reaching the precision employed were obtained.

Implementation of CSDA in feed-forward neural networks
Obtaining a closed-form expression in feedforward neural networks (FFNN) is not only challenging but also a tedious task. Nevertheless, CSDA can be implemented in the framework of the feedforward neural network (FFNN) for evaluating the variation of the output variable y ∈ R with respect to the change in an input x k ∈ R , where the subscript k represents the k th input. The extended form of CSDA (see Eq. 4) applied to a multivariate function can be expressed as are the input features, q is the number of input features, f (.) is the function mapping the input features to the output target variable and, f ′ (.) is the first-order derivative approximation of f (.) with respect to the input feature x k . Implementation of CSDA in FFNN involves three steps (see Fig. 2): (1) configure and train the FFNN for a given dataset, (2) perturb the input feature x k one at a time (see Eq. 7) with an imaginary step size of ih ( where h ≪ 10 −8 ) and perform the feedforward operation on the trained FFNN and (3) obtain the output neuron's imaginary component 1.E-13 1.E-10 1.E-07 1.E-04 1.E-01 1.E-16 1.E-13 1.E-10 1.E-07 1.E-04 1.E-01

Normalized Error
Step Size h with respect to the perturbed input and divide this component with the step size (h) . Configuring the FFNN is a trial-and-error process that involves finding the appropriate number of neurons and hidden layers in a network. A network is said to be configured when it is capable of learning an approximate mathematical mapping between the input features and the associated target variable such that it could be generalized to the unseen data instances. Guidelines for choosing trial configurations of FFNN can be found elsewhere [31]. For training the feedforward neural network, the backpropagation algorithm, in conjunction with the Levenberg-Marquardt optimization technique, is employed in this study [32]. Note that the code for implementing the CSDA in FFNN was written and executed in the MATLAB ® environment.

Illustrative example
For illustrating the effectiveness of the CSDA in computing the first order derivative of FFNN, a single variable function (see Eq. 8) commonly employed in CSDA literature is chosen. A single hidden layer with 100 neurons is configured to train the FFNN and the first order derivative is obtained at x = π 4 for step size of h = 10 −15 . Both FDA and CFDA are also employed on the same trained FFNN and first order derivative is obtained for same step size. The results along with the exact solution is provided in Table 1. From the Table 1 it is evident that the proposed methods result in least error (i.e., 2.9e−5) when compared to existing methods FDA (i.e., 0.145) and CFDA (i.e., 2.2e−3).
Furthermore the derivatives are evaluated for all the x values using CSDA, FDA and CFDA and is provided in Fig. 3. Comparison of exact solution and the first order  derivatives evaluated using CSDA, FDA and CFDA. From Fig. 3 it can inferred that the proposed CSDA method predicts the analytical quality derivative that coincides with the exact solution. However in the case of FDA and CFDA the derivatives are found to be inaccurate due to subtractive cancellation errors.
In what follows, the implementation of CSDA is demonstrated for regression and classification tasks using artificial datasets consisting of more than one variable.

Regression
The process of generating artificial datasets (from a known analytic function) for performing regression is described in this subsection. The first-order derivative results are then obtained from the CSDA implemented FFNN (see Eq. 7) and are compared with the exact analytical derivatives of the known function.

Datasets and FFNN Configurations
Three different single scalar-valued functions are employed in this study to generate artificial datasets for the regression task (see Table 2). While the first two functions R 1 , R 2 have 3 input features x 1 , x 2 and x 3 , the third function R 3 is chosen to have 4 input features x 1 , x 2 , x 3 and x 4 , wherein the feature x 4 represents the uniformly distributed random noise added to the function R 2 . Since the added noise x 4 has no significant contribution in evaluating the output of the function R 3 , the mean of the first-order derivative with respect to x 4 computed using CSDA would be expected to be zero. In other words, the purpose of adding noise is to verify the proposed method's ability to identify the least relevant feature. The input features employed in the dataset are real-valued and are independent of each other. In total, 2000 instances are randomly generated for each R 2 : y = sin(πx 1 ) + e x2 + x 2 3 : ∂y ∂x1 = π cos(π x 1 ); ∂y ∂x2 = e x2 ; ∂y ∂x3 = 2x 3 R 3 : y = sin(πx 1 ) + e x2 + x 2 3 + 0.00001x 4 ∂y ∂x4 = 1e − 5 dataset from a uniform distribution of the feature values. The range of the values chosen for each input feature for all three datasets is summarized in Table 3. These randomly generated input features are then substituted in the respective functions R 1 , R 2 andR 3 to obtain the associated target variables y for each dataset.
For obtaining a suitable FFNN configuration for each dataset, numerous trial configurations with varying numbers of neurons and hidden layers were examined beforehand. The trial configuration that resulted in a mean squared error (MSE) less than 1e−6 on the validation dataset is chosen as the suitable configuration for training the datasets. The final configuration of FFNN that was adapted to train dataset 1 is 1st hidden layer (HL) (8 neurons)-2nd HL (5 neurons); dataset 2 is 1st HL (10 neurons)-2nd HL (5 neurons); and dataset 3 is 1st HL (10 neurons)-2nd HL (5 neurons). Note that a soft plus function (see Fig. 4a) ( ln(1 + exp(�)) , where, is the net input function of a neuron) is used as an activation function for all the neurons in the hidden layers. The MSE of trained FFNN associated with dataset 1, dataset 2, and dataset 3 are determined to be 8.2e−7, 5.6e−8, and 4.3e−7, respectively.

Comparison of CSDA-FFNN output and the exact analytical derivative
CSDA is implemented on the trained FFNNs to evaluate the change in the predicted output variable y with respect to the input feature x j where j = 1, 2, and 3 for dataset 1 and dataset 2; and j = 1, 2, 3, and4 for dataset 3. Note that in CSDA implemented FFNN, the predicted output ( y ) is a complex variable. According to Eq. 7, only the imaginary component of y is required for obtaining the first-order derivative. More precisely, if g 1 , g 2 and g 3 indicates the approximate function (mapping x to y ) learned by FFNN for dataset 1, 2 and 3 , respectively, then the first-order derivative of g 1 , g 2 and g 3 with respect to the feature x j are computed as Table 3 Range of input features for generating regression dataset

Function
Range of input features where, q = 3 for dataset 1 and dataset 2, and q = 4 for dataset 3. Since there are 2000 instances in each dataset, the number of first-order derivatives evaluated with respect to each feature x j is also 2000. The comparison between the first-order derivative evaluated (for all 2000 instances) using CSDA implemented FFNN, and the exact analytical derivatives are provided in Figs. 5 and 6. From Fig. 5, it is evident that the derivatives of the approximation function g 1 (for dataset 1) evaluated with respect to features x 1 , x 2 and x 3 using CSDA are in good agreement with the exact analytical derivatives ∂R 1 ∂x 1 , ∂R 1 ∂x 2 and ∂R 1 ∂x 3 , respectively. Among all the data points for features x 1 , x 2 and x 3 , the maximum absolute error ( ε ) (see Eq. 6) was found to occur at x 1 = 1, x 2 = 0.006074 and x 3 = −1 .
Similarly, from Fig. 6 (a)-(c), it is evident that the derivatives of the approximation function g 2 evaluated with respect to x 1 , x 2 , and x 3 using CSDA are also in good agreement with the exact analytical derivatives ∂R 2 ∂x 1 , ∂R 2 ∂x 2 and ∂R 2 ∂x 3 . Among all the data points for features x 1 , x 2 and x 3 , the maximum absolute error ( ε ) was found to occur at x 1 = 0.9855, x 2 = 1 and x 3 = 0.0005 As mentioned earlier, in the case of function R 3 (see Fig. 6d) where the input feature x 4 is least relevant, the first derivative with respect to all values of x 4 are found to be scattered above and below the exact analytical derivative which is zero.

Classification
Unlike the regression task, evaluating the derivatives in the case of the classification task may not be feasible since the output of the FFNN is discrete (e.g., SoftMax activation function outputs). However, considering the fact that the inputs fed to the SoftMax activation neurons in the output layer are not discrete, the first-order derivatives of such inputs could still be evaluated. These first-order derivatives will aid in providing information about the importance of the input features. In this subsection, the process of generating an artificial dataset for demonstrating the implementation of CSDA for classification tasks is described, and its significance in determining the top features is illustrated.

Dataset and CSDA implementation
A binary class artificial dataset with input features x 1 , x 2 and x 3 is generated such that all the instances belonging to class label 1 are enclosed within a cylinder of unit radius, and the rest of the instances belonging to class label 2 are outside the cylinder (see Fig. 7). In total, 1000 instances are generated for each class label. Note that the feature x 3 is randomly chosen from a uniformly distributed noise with a zero mean, which has the least relevance in determining the class label. The purpose of including feature x 3 is to demonstrate that the proposed approach has the ability to identify the least significant features. The parametric equations used to generate the datasets are.
where r 1 ∼ U (0, 1) and θ 1 ∼ U (0, 2π ) where r 2 ∼ U (1, 2) and θ 2 ∼ U (0, 2π ) It is important to note that all three input features are independent of one another. Similar to the regression task, numerous trail configurations with a varying number of neurons and hidden layers were examined beforehand to obtain a suitable FFNN configuration, i.e., a configuration that has prediction accuracy > 98%. The configuration of FFNN that was chosen to train the dataset is 1st HL (8 neurons)-2nd HL (5 neurons). Note that Rectified Linear Unit (ReLU) (see Fig. 4(b)) ( max(0, �) , where is the net input function for a neuron) is used as an activation for all the neurons in the hidden layers and SoftMax function is used as an activation function for the neurons in the output layer.
The first derivative of the two net input functions in the output layer (i.e. o 1 and o 2 ) in FFNN with respect to input features x 1 , x 2 and x 3 are obtained for all the data points using CSDA, and the sum of their absolute values (i.e., the sum of all 2000 data points) are provided in Table 4. Considering that the first derivative (i.e. ∂� om ∂x j ) with respect to each input feature x j represents the proxy measure of its significance, the least relevant feature can be determined. In other words, the input feature that results in the lowest magnitude of the first derivative will be considered as the least relevant feature. From Table 4, it can be observed that the input feature x 3 has the lowest Class Label 1 : x 1 = r 1 cos(θ 1 ); x 2 = r 1 sin(θ 1 ); x 3 ∼ U(0, 0.0001) Class Label 2 : x 1 = r 2 cos(θ 2 ); x 2 = r 2 sin(θ 2 ); x 3 ∼ 0.0001 * U(0, 1) magnitude when compared to features x 1 and x 2 . Therefore feature x 3 can be said to be the least relevant feature. In order to verify if the feature x 3 is irrelevant, the FFNN is trained again with the exclusion of feature x 3 and the confusion matrix is shown in Table 5. From the confusion matrix, it is evident that the exclusion of feature x 3 does not influence the accuracy of classification. Furthermore the precision and recall were also determined i.e., 0.99 and 0.98 respectively, and were noticed to be uninfluenced by the exclusion of feature x 3 .

Summary and future work
In this study, a novel method is proposed to compute the analytical quality first derivative of the FFNN output with respect to the input features. The major drawback of popularly used finite difference approximation schemes is highlighted, and the concept of CSDA is introduced in the context of neural network differentiation. The significance of CSDA in circumventing subtractive cancellation errors is then illustrated using a simple example wherein analytical quality first derivative was obtained along with a normalized relative error approaching the double float precision. A step-by-step procedure involved in extending CSDA to FFNN is provided, and its implementation in regression and classification tasks is demonstrated by employing artificial datasets generated from known functions. FFNN's are configured and trained using the trial-and-error process for all the artificial datasets. The first derivative results that are obtained from the CSDA implemented FFNN output for the regression task was found to be in good comparison with the exact analytical derivatives of the known function. Owing to the discrete output in the classification task, the CSDA was used only to obtain the derivatives of net input function in the output neurons, and the least relevant features are identified. Note that the advantage of the proposed method is that it evaluates the derivative only in one feedforward operation and does not require the evaluation of derivatives used in the backpropagation. Hence, the derivatives can be obtained even after the neural networks are deployed for specific applications.
It is important to note that feed-forward neural network was trained with the artificially generated data to learn these analytical functions so that the first-order derivatives could be verified. Standard regression/classification datasets were not included in this study since the closed form expression of the function that maps the input variables to the target output is unknown in feed-forward neural networks which may make it difficult to verify the analytical quality derivative. However, the authors are currently working on implementing the proposed method on standard datasets. Furthermore, in the future work, the authors intend to extend the proposed method to the multiple output regression problems and investigate the influence of different activation functions. In addition, the derivatives evaluated using the proposed method will be employed to perform feature selection on real datasets and validate using the currently available methods.