Selecting critical features for data classification based on machine learning methods

Feature selection becomes prominent, especially in the data sets with many variables and features. It will eliminate unimportant variables and improve the accuracy as well as the performance of classification. Random Forest has emerged as a quite useful algorithm that can handle the feature selection issue even with a higher number of variables. In this paper, we use three popular datasets with a higher number of variables (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. There are four main reasons why feature selection is essential. First, to simplify the model by reducing the number of parameters, next to decrease the training time, to reduce overfilling by enhancing generalization, and to avoid the curse of dimensionality. Besides, we evaluate and compare each accuracy and performance of the classification model, such as Random Forest (RF), Support Vector Machines (SVM), K-Nearest Neighbors (KNN), and Linear Discriminant Analysis (LDA). The highest accuracy of the model is the best classifier. Practically, this paper adopts Random Forest to select the important feature in classification. Our experiments clearly show the comparative study of the RF algorithm from different perspectives. Furthermore, we compare the result of the dataset with and without essential features selection by RF methods varImp(), Boruta, and Recursive Feature Elimination (RFE) to get the best percentage accuracy and kappa. Experimental results demonstrate that Random Forest achieves a better performance in all experiment groups.


Important features study
Variable importance analysis with RF has received a lot of attention from many researchers, but there remain some open issues that need a satisfactory answer. For instance, Andy Liaw and Matthew Wiener using RF for classification and regression problems, they use R language to solve the problem [14]. Other research combines RF and KNN on the HAR dataset using Caret [15]. Moreover, in [16] introduced RF methods to Diabetic retinopathy (DR) classification analyses. These research results suggest that RF methods could be a valuable tool to diagnose DR diagnosis and evaluate their progression. Hence, Grömping [17] compares the two approaches (linear model and random forest) and finds both striking similarities and differences, some of which can be explained whereas others remain a challenge. The investigation improves understanding of the nature of variable importance in RF. RF has been discussed as a robust learner in several domains [18,19]. Feature selection aims at finding the most relevant features of a problem domain. It is beneficial in improving computational speed and prediction accuracy [20]. In [21], a comparative analysis using Human Activity Recognition (HAR) dataset based on machine learning methods with different characteristics is conducted to select the best classifier among the models. This study showed that the RF approach has high precision from each category and is considered the best classifier [22]. Further, the combination of RF, SVM (Support Vector Machine), and tune SVM regression to improve the model performance could be found in [23]. The experiment describes that the best features to improve model performance are essential [24]. The feature selection is handy for all disciplines, more instance in ecology, climate, health, and finance. However, Table 1 describes in detail the application of feature selection.
In this experiment, the model-specific metrics Random Forest from the R package were used. For each tree, the prediction accuracy on the portion of the data is registered. Then the same is finished after permuting each predictor variable. The difference between the two accuracies is then averaged over all trees, and normalized by the standard error. We use train()function the desired model using the caret package. Then, use the varImp()function to determine the feature importance by RF.
Recursive Feature Elimination (RFE) offers an accurate way to define the prominent variables before we input them into a machine learning algorithm. Guyon et al. [74] proposed RFE, which is applied to cancer classification by using SVM. RFE employs all features to build an SVM model. Next, it ranks the collaboration of each feature in the SVM model into a ranked feature list. RFE then finally eliminates the unrelated features that have a meaningless contribution to the SVM model. Moreover, RFE is a powerful algorithm for feature selection, which depends on the specific learning model [75,76].
Boruta is a feature selection algorithm and feature ranking based on the RF algorithm. Boruta's benefits are to decide the significance of a variable and to assist the statistical selection of important variables. Besides, we can manage the strictness of the algorithm by adjusting the p value that defaults to 0.01. maxRun is the number of times the algorithm is run. The higher the maxRun, the more selective we get in choosing the variables. The default value is 100. For the confirmation of feature selection, our experiment has followed the Boruta package in the R programming language [77]. This package is based on the wrapper, which builds around the RF classification algorithm, and works on the RF method to determine significant features. It tries to capture all the interesting and important features in each dataset that have an outcome variable. This algorithm performs a top-down approach for relevant features with the comparison on the set of original attributes.

Classifiers method
Random Forests (RF) consists of a combination of decision-trees. It improves the classification performance of a single tree classifier by combining the bootstrap aggregating method and randomization in the selection of data nodes during the construction of a decision tree [78]. A decision tree with M leaves divides the feature space into M regions Rm, 1 ≤ m ≤ M. For each tree, the prediction function f(x) is defined as: where M is the number of regions in the feature space, Rm is a region appropriate to m; c m is a constant suitable to m: The last classification conclusion is made from the majority vote of all trees. K-Nearest Neighbor (KNN) [79,80] works based on the assumption that the instances of each class are surrounded mostly by instances from the same class. Therefore, it is given a set of training instances in the feature space and a scalar k. A given unlabelled instance is classified by assigning the label, which is most frequent among the k training samples nearest to that instance. According to many different measures that are used for the distance between instances, the Euclidean distance is the most frequently worn for this purpose [81]. Some of the previous researches about KNN could be found in [82][83][84]. The type of distance metric used in this method is Euclidean distance described in the equation below: Linear Discriminant Analysis (LDA) [85] usually used as a dimensionality decrease technique in the pre-processing step for classification and machine learning applications. The goal is to project a dataset into lower dimensional space with good separable class-to avoid over-fitting and to reduce computational costs. LDA is usually used to discover a linear combination of features or variables. The combination is beneficial for dimensionality reduction. LDA yields scattered classes from the fixed dataset. It is due to the distance between the training data in a class that is made shorter [86]. The purpose of LDA is maximizing the between-class measure while minimizing the within-class measure. Let C i be the class containing the state binary vectors x corresponding to the i th activity class. Then the linear discriminant features are performed in the following way. It consists of solving the generalized eigenvalue problem: With the between-class scatter matrix, SB and within-class scatter matrix S −1 W are calculated [87]. The number of reduced variables will be at most N-1 because there only N points to estimate SB. Support Vector Machines (SVM) is a machine learning algorithm. In recent years, there has been plenty of researches introduce SVM as a powerful method for classification. An overview can be found in [88][89][90][91] and can be used to regression [30,92]. The other research describes that SVM uses a high dimension space to find a hyperplane in order to perform binary classification where the error rate is minimal [93,94]. The problem with SVM is to separate the two classes with a function obtained from the available training data [36,95,96]. The aim is to produce classifiers that will work well on other problems. The input vectors are maximal to separate two regions that are the hyperplane function in SVM. SVM is not limited to separate two kinds of objects and that there are several alternatives to dividing lines that arrange the set of objects into two classes. This technique seeks to find an optimal classifier function that can separate two sets of data from two different categories. In this case, the separating function aimed is linear.
With f (x) = w T x + b, w, x ∈ R n and b ∈ R , w and b are the parameters for which value is sought. The best hyperplane is located in the middle between two sets of objects from two classes. Finding the best hyperplane is equivalent to maximizing the margin or distance between two sets of objects from two categories. Samples located along a hyperplane are called support vectors. In this technique, it is attempted to find the best classifier/hyperplane function among functions.

Classification and Regression Training (Caret) Package
The Caret package has several functions that arrange to streamline the model building and evaluation process. This package consists of 30 packages and contains functions to shorten the model training process for classification and complex regression problems. Moreover, Caret will execute packages as needed and assumes that they are installed. If a modelling package is missing, there is a prompt to install it. The package accommodates tools for data splitting, pre-processing, feature selection, model tuning using resampling, variable importance estimation, as well as other functionality [97,98]. A classification tree algorithm is a nonparametric approach. This method is a one classification method that does not depend on certain assumptions and able to explore complex data structures with many variables. The data structure can be seen visually [99]. Moreover, the classification tree algorithm also enables it to interpret the results easily.
Random Forest is divided into two, regression trees and classification trees. When an RF is used for classification, it is more accurate to call it a classification tree. When it is used for regression, it is known as a regression tree. The classification tree in the response variable is categorical data, whereas, in the regression tree, the response variable is continuous data. Classification trees are rules for predicting the class of an object from the values of predictor variables. Trees are formed through repeated data sealing, in which the level and benefits of the predictor variables of each observation in the sample data are known. Each partition (split) data is expressed as a node in the tree formed. Figure 1 describes the workflow of this research. The experiment consists of several steps. First, collecting the dataset from the University of California Irvine (UCI) machine learning repository. Further, this work uses three popular datasets (Bank Marketing, Car Evaluation Database, Human Activity Recognition Using Smartphones) to conduct the experiment. Second, our work applies features selection method RF, Boruta, and RFE to select essential features. The next is the comparison of different machine learning models such as RF, SVM, KNN, and LDA methods for classification analysis. The determination of an ideal subset of highlights from a list of capabilities is a combinatorial issue, which cannot be understood when the measurement is high without the association of specific suspicions or bargain that results in just problematic arrangements. Here our experiment utilizes a recursive methodology to move toward the issue. Different models will have different strengths in classification data analysis. We will compare four classifiers method with various features to select the best classifiers method based on the accuracy

Research workflow
of each classifier. The whole work has been done in R [97,98] a free software programming language that is specially developed for statistical computing and graphics.

Model performance evaluation
The performance is evaluated based on the calculation of accuracy. Accuracy is how often the model trained is correct, which depicted by using the confusion matrix. A confusion matrix is the summary of prediction results on a classification problem [100]. A classification system is expected to be able to classify all data sets correctly, but the performance of a classification system is not entirely spared error. The form of error is in classifying new objects into a class (misclassification). The confusion matrix is a table recording the results of classification work.
The confusion matrix in Table 2 has the following four results [101]. True positive is a condition when the observations coming from positive classes are predicted to be positive. Then, False-negative is a condition when the actual observation comes from a positive but in positive negative predicted class. False-positive is a condition when the actual observation coming from negative classes but predicted to be positive. Lastly, True negative is a condition when observations from negative classes are predicted to be negative. The performance evaluation in classification can be justified by precision  and recall. Recall/True Positive Rate can be defined as the level of accuracy of predictions in positive classes and the percentage of the number of predictions that are right on the positive observations. Moreover, accuracy is the percentage of overall predictions that are right on all observations in the data group. Apart from looking at the confusion matrix, the assessment of the goodness of a classifier's prediction can be seen from the Receiver Operating Characteristic (ROC) [102,103] and Area Under the Curve (AUC) curves [104]. Based on the contents of the confusion matrix, it can be seen the amount of data from each class is correctly predicted and classified incorrectly. Then calculate the accuracy and prediction error rates using the equation below: [105] where

Dataset descriptions
This experiment uses three datasets publicly available from the UCI machine learning repository. Moreover, the three datasets belong to classification data that have different total instances and features. The description of each dataset could be found in Table 3. Table 3 describes a dataset that belongs to classification data. In this experiment, we use the Bank marketing dataset published in 2012 with 45,211 instances and 17 features. Next, the car evaluation database in 1997 with 1728 instances and six features, and Human  Activity Recognition Using Smartphones Dataset in 2012 with 10,299 instances and 561 features. The ability to mine intelligence from these data more generally, big data has become highly crucial for economic and scientific gains [106,107]. Further, feature descriptions and explanations for each dataset could be seen in Tables 4, 5, 6, and 7. The set of variables estimated from the 3-Axial signal in the X, Y, and Z can be seen in Table 6. Additional vectors obtained by averaging the signals in a signal window sample can be seen in Table 7.
Features selection by RF, Boruta, and RFE for Bank Marketing Dataset displayed in Figs. 2, 3, 4, and 5. First, in RF, the process of solving at each parent node is based on the goodness of split criterion, which is based on the function of impurity. The solving rule used is the towing criterion. The goodness of split is an evaluation of solving by s at node t. A split s in node t is divided into t R with the proportion of the number of objects. Then, i function with t R has probability P R and with t L has probability P L . In addition, P L with the number of  objects in t L can be defined as P L (decreasing impurity). It means that the solution is done to make two new vertices with a smaller (homogeneous) diversity when compared to the initial node (parent node). Solving the t node using split s will produce a new classification tree that has a tree impurity. This value is smaller than the tree impurity from the previous classification tree.  The breakdown criteria are based on the greatest value of the goodness of split [ Φ(s, t)] . Discrete attributes only have two branches for each node, so that every possible value for the node must be partitioned into two parts. Each combination forms a candidate splits an alternative that will be selected to compile partition initials on root nodes and other nodes based on the highest goodness of split values. Before performing the goodness of split in continuous attributes type, the attribute must find the threshold to calculate the goodness of split in attributes. Split-points are obtained by looking for the average value of 2 attribute values that have been sorted first. On a continuous type attribute, the case is labelled with an attribute value less than or equal to the threshold value (A ≤ v) and attribute, which has a more significant value than the threshold value (A > v).   In Random Forest, re-sampling is used by using cross-validation ten folds, and the best accuracy is at mtry = 2. It means that we take two random variables from our data set and examine them for one tree. Therefore, from the next tree would be taken two more random variables, examine them, so on and so forth until it runs through the numbers that we specify and then return the average estimates for the best/most important variables and justify by kappa (0.3444818). Figure 2 explains that seven variables are important to be used, including duration, balance, age, poutcomesucess, pdays, campaign, and housingyes. Then the variable will be used to form the model. Our research operates cross-validation to see the accuracy of each of these variables, which can be seen in Fig. 3 and perform the Boruta in Fig. 4.
Moreover, these experiments perform KNN, -tested with k = 5, 7, and 9, which resampling using cross-validation tenfold. It obtained k = 9 is best used with an accuracy value of 0.8841308 and kappa 0.2814066. Then do the same thing in SVM by comparing the C cost (0.25,0.50, and 1) obtained the best accuracy value at C = 1 with sigma 0.2547999 reach the accuracy 0.8993641 and kappa 0.355709. Finally, we perform LDA with tenfold cross-validation that obtained accuracy 0.898037, and kappa 0.4058678. These experimental results are fully explained in Tables 8 and 9. Figure 5 displays the selection of 7 features based on RF + RF, RF + SVM, and RF + KNN. The KNN accuracy will increase when using neighbors values that are getting bigger. Then in the random selection of predictors, the best is the predictor with a large number. Furthermore, in RF + SVM, the best accuracy is to use a cost that is close to 1.

Car dataset
At the simulation stage of the Car Dataset in Random Forest, we apply 1384 samples, 4 predictors, and 4 classes (acc, good, unacc, vgood). Next, the resampling stage was mtry (2, 7, and 12). Besides, the best result is mtry = 7, with an accuracy of 0.9436328 and kappa 0.8784367. Moreover, In modeling with KNN, the optimal model is obtained by k = 5 with an accuracy of 0.7969389 and kappa 0.5683084. Furthermore, the SVM resampling cross-validation 10 fold and the tuning parameter "sigma" was held constantly at a value of 0.07348688, C = 0.5 reach the accuracy 0.8346161, and kappa 0.6319634. Lastly, LDA achieves accuracy = 0.8431124, and kappa = 0.6545901 are fully explained in Tables 10 and 11. Features selection by RF, Boruta, and RFE for Car Evaluation Dataset could be seen in Figs. 6, 7, and 8. Figure 9 portrays the selection of 4 features based on RF + RF, RF + SVM, and RF + KNN. In this case, the greater choice of the attribute does not guarantee to reach high accuracy. This is proven by the final value used for the model RF + RF was mtry = 7. However, in RF + SVM tuning parameter, sigma was held constant at a value of 0.07348688. Accuracy was used to select the optimal model using the largest value. The final values used for the model were sigma = 0.07348688 and C = 0.5.     haphazardly. To choose highlights, we iteratively fit irregular Random Forest, at every emphasis fabricating another iteration disposing of those factors with the littlest variable significance. Figure 11 illustrates the Random Forest for creating a classification tree. This processing is recursive partitioning, which means the solving process is repeated for each child node as a result of previous solutions. This solving process will continue until there is no chance to do the next solution. The term partition means that the sample data owned is broken down into smaller parts or partitions. Figure 12 describes the important measure for each variable of the HAR dataset. Boruta performed 99 iterations in 1.04146 h. In this process, 404 attributes confirmed important: V1, V10, V100, V101, V103, and 399 more, 58 attributes confirmed unimportant: V102, V107, V111, V128, V148 and 53 more, and 100 tentative attributes left: V104, V105, V110, V112, V115 and 95 more. This work employ varImp(fit.rf ) function to generate important features by RF. Next, to select important features by RFE, our experiment uses RFE function with various parameters such as rfeControl(functions = rfFuncs, method = "cv", number = 10). Moreover, we     Figure 13 represents the selection of 6 features on RF + RF, RF + SVM, and RF + KNN. Exactly similar to the car dataset, the best predictor is 2 in the HAR dataset, so the selection of many predictors does not guarantee high accuracy. The RF + SVM result is the selection of cost = 1, which will improve accuracy accordingly. Finally, for RF + KNN, the selection of the best neighbor appears to be 7.

Evaluation performance and discussion
The contributions of the simulation paper are to see the different insights in each experimental data such as Bank Marketing dataset in Tables 8 and 9, car evaluation dataset  in Tables 10, and 11 as well as human activity recognition using smartphones dataset  in Tables 12 and 13. We perform 80% of training data and 20% testing data in each experiment. To compare the accuracy, this work is following metric="Accuracy. "At the same time, we are comparing the accuracy from different classifiers method by following trainControl(method = "cv", number = 10), and different method parameter to do the experiment (method = "lda", method = "knn", method = "svmRadial", and method = "rf "). The determination of the hyperplane function for classification in this study is done by optimizing margins.
Additionally, the problem is formulated into Quadratic Programming (QP) by completing an optimization function. Optimization function is simplified by transformation into the Lagrange function. This function creates a hyperplane that separates data according to every class. The calculation is intended to find the value of Lagrange Multiplier (α) and b value. The error values are obtained in each classification performance measurement with several pairs of parameter values (C parameters and kernel parameters). The values tried to determine which pair of parameter values is best in the classification of this study. The following is the error value obtained for each pair of amounts of the cost (C) parameter and kernel parameters that have been predetermined. Other than including determination methodology, in [107] additionally portrayed the best approach to error rates. Furthermore, in [108] investigate the use of random forest for classification of microarray data (including multi-class problems) and propose a new method of gene selection in classification problems based on random forest. To evaluate the expectation mistake error of all methods we use the bootstrap strategy as proposed by Efron and Tibshirani [109]. Their experiment shows that a particular bootstrap method substantially outperforms cross-validation in a catalogue of 24 simulation experiments. Besides providing point estimates, it also considers estimating the variability of an error rate estimate [110]. The bootstrap strategy utilizes a weighted normal of the re-substitution mistake (the blunder when a classifier is applied to the preparation information) and the mistake on tests is not used to prepare the indicator.
Tables 8, 10, and 12 describe the result of the classification accuracy of different classifiers with different features selection method Boruta, RFE, and RF. The result shows that the RF method has high accuracy in all experiment groups. According to Table 8, the RF method has a high accuracy of about 90.88% with all features (16 features) and 90.99% accuracy with 7 features. Moreover, in Table 10, the RF method leads to 93.31% accuracy with 6 features and 93.36% accuracy with 4 features. In regards to the next experiment result in Table 12, the RF method gained 98.57% accuracy with 561 features and 93.26% accuracy with only 6 features. In general, the trend of accuracy will decrease because of features limitation. We could get good accuracy if we select the important features by the feature's selection method. Random Forest in data mining is prediction models that are applied to describe the forms of classification and regression models. Decision trees are utilized to identify the most likely strategies to achieve their goals. The use of the Random Forest is a widespread technique in data mining in addition to get high accuracy RF + RF. The favors of using decision trees as a classification tool include: (1) RF is easy to understand. (2) The RF can handle both nominal and continuous attributes. (3) The RF represents enough discrete classification values. (4) RF is included in nonparametric methods, so they do not require distribution assumptions.
Lately, the fame of big data exhibits some difficulties for the traditional feature selection task. Meanwhile, some unique characteristics of big data also bring about new possibilities for feature selection research [111]. The latest advances in feature selection are a combination of feature selection with deep learning especially the Convolutional Neural Networks (CNN) for classification tasks, such as applications in bioinformatics neurodegenerative disorders classification using the Principal Components Analysis (PCA) algorithm [112,113], brain tumor segmentation [114] using three planar super pixel based statistical and textural features extraction. Next, remote sensing imagery classification using a fusion of CNN and RF [115], and software fault prediction [116] using enhanced binary moth flame optimization as a feature selection, and text classification based on independent feature space search [117].