Determining threshold value on information gain feature selection to increase speed and prediction accuracy of random forest

Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared.


Introduction
Data development increases dimensions and computational costs are overcome by feature selection and extraction, which are two different techniques [1,2].Several studies have shown the ability of the feature extraction process to build a new feature set [3][4][5][6].
in feature weight scoring and to determine the maximum entropy value.However, as a basic technique, IG is still open to further research and development in feature selection.Elmaizi [18] proposed a new approach based on IG for image classification and dimension.Similarly, Jadhav [19] proposed feature selection based on IG ranking based, while Singer [20] developed a model known as Weighted Information-Gain (WIGR), which defines proportionally weighted entropy.
All dataset features in IG were counted, selected, and defined by a value limit known as the threshold (cutoff ).The threshold value of 0.05 [21,22] is often set freely as required and used whenever a study requires good accuracy at a lower level.Tsai and Sung researched by calculating the average of each frequency to obtain the threshold value of the final features' subset [23].Preliminary studies determined the acquired threshold value according to the standard deviation of the IG rate.Furthermore, each feature's weighting result was calculated, while the threshold value was determined using the standard deviation.However, in this study, the standard deviation is used to express the diversity of IG value distribution with Information Gain chosen because of its ability to measure the data possessed by each feature.The IG method is used in decision trees to maximize the richness of information.This study used a simpler method with original and transformed data sets, while the transformation of each feature's IG value was performed using the FFT to accelerate the algorithm's performance.In addition, this study comprises ten datasets with more than 100 features (high-dimensional datasets) compared to others with less features.Several studies do not consider the speed of execution in Random Forest usage.
The Random Forest is a tree-based learning algorithm machine, which leverages the power of multiple decision trees for making decisions [24].The feature selection in Random Forest calculations is selected more than once, and this involves a haphazard process that requires a very long computational time.Moreover, the feature selected to construct a decision tree may not be informative.
Fast Fourier Transform (FFT) is an algorithm applied to increase execution speed.This recursive method involves dividing the original vector into two parts, combining and calculating their individual FFT.Several studies stated that FFT enhances execution and is used in feature extraction methods.For instance, Herf [25] stated that FFT could be used on a dataset of time series collected in sequential time series, such as clinical data [26,27].The application of the FFT algorithm to the dataset will not change the data, because the IFFT algorithm will returns the dataset to its original data.Prasetiyowati et al. [8], in his research analyzed this method and produced the accuracy value and time needed better than the original dataset.Therefore, based on the results of these studies, FFT is applied in this study.Therefore, based on the results of these studies, FFT is applied in this study.Besides being applied to the dataset, FFT is also applied to feature extraction [28,29], and selection [30].Gowid et al. [30] used used this process to develop robust, fast, and automated feature selection algorithms for mechanical systems.Based on these studies, the FFT algorithm is also used to perform feature selection.Data and Information Gain values for each feature are transformed as a signal wave with various values.Differ from the previous research is that this study examines whether the dataset and features transformed by FFT and IFFT produce better accuracy & average speed values compared to choosing the Correlation-Base Feature Selection model and threshold of 0.05.Also, FFT and IFFT were generally used on image or signal datasets, while in this studies both are used for non-image information.
This study follows previous research on the use of feature selection to increase the Random Forest method performance on high dimensions [31].It also examines the speed and accuracy evaluation of Random Forest performance by selecting features in the transformation data [8].
The key contribution of this paper is provided as follows, -Propose a feature selection method in Random Forest.
-The proposed feature selection method is Information Gain, using a threshold with a standard deviation calculation, -Compares the mean value of Random Forest accuracy and speed from the results, with standard deviation, Correlation-Base Feature Selection, and threshold of 0.05, -Compares the mean value of Random Forest accuracy and speed from the results, using the original and transformed dataset, through FFT and IFFT, -Compares the mean Random Forest speed and accuracy, using features transformed with FFT.
The research is divided into several sections.The 2nd, 3rd, 4th, 5th, and 6th sections are the related work, the proposed method, the research results, discussion, and conclusion, respectively.

Information gain
Information Gain (IG) is an entropy-based selection method [32], which involves the calculation from the output data grouped by feature A, denoted as gain (y, A).The Information Gain (y, A) is represented as, The value (A) is the possible rates of attribute A, with Yc being the subset of y, where A possesses the sum of c.Furthermore, the rule of Eq. ( 1) was the total entropy of y, followed by data segregation, based on feature A.
Studies are still carried out on the development of Information Gain to date.An instance is a study conducted by Elmaizi [18], which proposed a new approach based on IG for hyperspectral image classification and dimensional reduction.The hyperspectral band selection was used to select the most informative ribbons and remove irrelevant and noisy bands.The comparison results showed that the information retrieval filter approach was superior, reduced computational costs, and enhanced classification accuracy.Moreover, the dataset used were two of the hyperspectral images obtained from The Indian Pines AVIRIS and The Pavia University.This study is in contrast with the research carried out by Jadhav et al. [19], which proposed feature selection through ranking based on IG.Furthermore, the technique used was known as the Information Gain Directed Feature Selection algorithm (IGDFS), which is a method that makes use of feature ranking, based on data acquisition through the GA wrapper (GAW), and three (1) gain y, A = entropy y − � C∈vals(A) y c y entropy yc classic KNN machine learning algorithms, Naïve Bayes, and Support Vector Machine (SVM).Furthermore, this method reduces computation costs and improves classification accuracy.It only uses 3 datasets with less than 100 features, including The German (20 features), The Australian (14 features), and Taiwan (24 attributes) credit datasets.Singer [20] proposed a model that defined proportionally weighted entropy known as Weighted Information-Gain (WIGR).The method used was measured through a weighted entropy function that was defined proportionally with different target class values.Singer's study used 12 datasets with less than 100 features (min.7 and max.32).

Threshold
Threshold, also known as a threshold (cutoff ), is the value used as a reference for the selected feature in IG.The threshold value is determined independently or uses a value of 0.05.Tsai and Sung used calculations and averaged each frequency to obtain a final feature's threshold value [23].Tsai's idea allows the determination of the threshold value using standard deviation.
The process of determining the data group diversity involves reducing the information value through the association's mean and adding the results.This method is known as standard deviation and describes the difference between the measured data against the average value.In this study, the data group is the IG value for each feature in a dataset, which is obtained using Eq. ( 2).
where S is the standard deviation, x is the average value of the IG, xi is the rate of x to i, and n is the number of features used in the dataset.

Random forest
Random Forest is an extension of the decision tree approach developed by Breiman [33][34][35] and Cutter.It is a tree-based machine learning algorithm that harnesses the power of multiple decision trees in decision-making.The feature selection in the Random Forest calculation is performed carelessly and more than once and requires a very long computation time.As a result, this random process allows the selected feature to be non-informative.
Several preliminary studies examined feature selection for Random Forest.For instance, Yunming's [28] study proposed a stratified sampling method to select feature subspace for Random Forest with high dimensional data as well as strong and weak informative features.

Fast fourier transform and inverse fast fourier transform
An effective medium of converting time-domain signals to frequency domination is the Fast Fourier Transform (FFT) [36] with the Inverse Fast Fourier Transform (IFFT) algorithm used to convert data to the original domain.. Furthermore, the test of the transformed data using FFT and IFFT is applied to high-dimensional and regular datasets with less than 100 features.Hamid used the FFT algorithm to enhance the classification results (2) through extraction and signal processing [37].Additionally, Herff [25] stated that clinical data is the time series information often processed and collected sequentially and presented in a continuous waveform using the FFT algorithm.Prasetiyowati et al. [8] also researched the application of the FFT using the Correlation Base Feature Selection.The result showed that the transformed dataset produces an average accuracy and time value better than the original.The transformed dataset is returned using IFFT.In addition, other studies used the FFT algorithm to perform feature extraction [28,29,37].For instance, Ansari used Fast Fourier Transform (FFT) to extract features in the EEG dataset [28], with better accuracy than other classifiers.Meanwhile, in another study, Gowid et al. used FFT for feature selection in mechanical systems [30], with a detection accuracy of 100%.
When X[k and n] is a complex number, and W kn N is the Twiddle factor.N is the order and kn is the index.Defined as follows j is an imaginary number, index n is the time variable t in discrete form, and k is the frequency transformation pair Inverse Fast Fourier Transform (IFFT) is the opposite of FFT, a fast algorithm for calculating IDFT (Inverse Discrete Fourier Transform).IFFT is also calculated using the direct FFT algorithm and complex conjugates where X [n] is the inverse of X [k], using the opposite sign and multiplied by a factor of 1/N.

The proposal methods
The search for features in the Random Forest allows the selected feature to be formative.Therefore, a feature search method is needed before executing Random Forest to ensure the features are informative, speed up execution, and increases accuracy.
Information Gain is the proposed feature search obtained using a threshold based on the standard deviation value using Eq. 1.
After determining the Information Gain value for each feature, the next step is to obtain the standard deviation value from the data using Eq. 2. (3) When x is the average value of the Information Gain (IG), and n is the number of features used in the dataset.When searching for the standard deviation value, which also acts as the threshold, the Mean value of the Information Gain automatically appears.This is because before looking for the standard deviation, the mean is first determined.The Information Gain value for each feature is calculated, and the standard deviation is determined as the threshold limit.Furthermore, an IG value equal to or above the threshold value is selected as an informative feature for later use in the Random Forest.The overall framework of the proposed method is shown in Fig. 1.

Workflow
This research consisted of 3 stages.In the first, the experiment uses original data.Second, change the dataset by using FFT and IFFT.And third, change the features using FFT.Next, compare the results of these steps by performing a feature transformation using FFT.These steps are shown in the pseudo-code of the research stage algorithm, including the process of calculating speed and accuracy.
Data were obtained by collecting 10 existing datasets from the UCI and Kaggle, with 3 used in previous studies [8].The ten datasets were checked for missing values, and in the presence of any, data is completed by giving a zero value [38][39][40].The next step involved checking whether the dataset needed transformation with the existing dataset transformed using the FFT algorithm and returned using IFFT.Furthermore, the time and accuracy required to execute each dataset were calculated.However, when the dataset needed no transformation, Random Forest prediction's time and accuracy value was immediately calculated.The pseudo-code of the research stages algorithm, speed, and accuracy calculations were listed in Algorithm 1 and 2. The calculation of the needed accuracy and time value requires the following steps, 1. Collection of dataset, 2. Selection of the IG feature through the Ranker method, by using the Weka machine learning tools (version 3.9.2).

Random
3. Calculating the standard deviation of the IG for each feature.The value obtained during the calculation of all features' standard deviation is known as the threshold.All features possessing a value greater than or equal to that of the threshold value should be selected.Those having a lesser value than that of the threshold should be discarded.The Random Forest prediction process was carried out, using 10 randomly selected seed values, including 1, 33, 57, 70, 153, 251, 300, 457, 505, and 700.Therefore, the aforementioned steps were conducted for each dataset.
After the selection process using the Random Forest algorithm, the test results with the proposed threshold were compared with the prediction outcomes using Correlation-Base Feature Selection (CBFS) and the threshold 0.05 technique.In addition, the comparisons were performed using the original and transformed datasets.

Data experiment
The experimental environment is a computer with an Intel processor of 1.60 GHz, 1800 MHz, 4 Core (s), 8 Logical Processor with 12 GB RAM, and 1000 Gigabyte hard drive capacity.This study used a multivariate dataset, real, categorical integer characteristics, and the text numeric converted values.Almost all datasets have complete data except the Dermatology, which had missing values.It is necessary to pre-process the dataset by entering a zero value in the missing data.This study selected some Life, Physical, and Business datasets in the UCI Machine Learning Repository [41].The Life area dataset consists of EEG Eye, Cancer [42], Contraceptive Method, Dermatology, Divorce [43], Epilepsy [44], and SCADI [45,46].While the Physical area dataset consists of the Electrical Grid and Urban Land Cover dataset [47,48].The dataset for the business area is the CNAE-9.
This research consisted of 3 stages.In the first, the experiment uses original data.Second, change the dataset by using FFT and IFFT.And third, change the features using FFT.Next, compare the results of these steps by performing a feature transformation using FFT.
The third process produces complex numbers (real and imaginary).Furthermore, to determine the threshold, the authors used two methods, namely: 1. Calculating the standard deviation using real and imaginary numbers.2. Calculating the standard deviation using only real numbers (without imaginary).

Result
This study used the Random Forest method and K-Cross Fold Validation, with a value of K = 10.Each dataset used ten randomly generated seed values, namely 1, 33, 57, 70, 153, 251, 300, 457, 505, and 700.Furthermore, tests were carried out on the original and transformed dataset through the use of FFT and IFFT methods.Before transforming datasets with FFT, always ensure that the information for each feature is numeric, with no missing values.This study used the IG as the feature selection technique, with the value of each feature used to calculate the standard deviation parameter as a threshold determination.Furthermore, feature selection was applied to the datasets used, which were then analyzed by utilizing the Random Forest technique.Then, compare the average value of accuracy and time required by the three methods, namely the standard deviation threshold method, the CBFS method, and the 0.05 Threshold method.This comparison also applies to features that have been transformed and use the threshold method based on standard deviation.

Accuracy and speed dataset
EEG dataset The first test was conducted on the EEG Eye dataset, which had 14,980 instances with 14 features.Through IG using the proposed threshold, the standard deviation value is 0.0171, the average value is 0.0289, and 10 features are selected.Furthermore, the resulting average accuracy value was higher (90.15%), and outperforming the rate generated by the CBFS and threshold 0.05.However, the average execution time was faster, through the use of the Correlation-Base Feature Selection.Also, the distinction of the average time was 0.74 secs.
Meanwhile, on the test using a transformed dataset and proposed threshold, the resulting standard deviation value was 0.0171, with 10 selected features.The resulting average accuracy value was higher (90.14%), and also outperforming the rate generated by the CBFS and threshold value of 0.05.However, the average time needed was longer (9.41 secs), compared to the threshold value of 0.05, which only required 4.96 secs.Therefore, the distinction on the average time was 4.45 secs.

Cancer dataset
The second test used the Cancer dataset, which had 569 instances with 31 features.The average value of the highest test accuracy on this dataset was generated by a threshold of 0.05, which was 96.63% with 26 selected features.The CBFS further produced 12 selected features with an average accuracy of 95.79%.Moreover, the proposed threshold had an average accuracy of 94.39%, by using 15 features, the average value is 0.4008 and a standard deviation of 0.2384.However, in terms of the average time required, the proposed threshold was superior to the two methods being compared, which was 0.07 secs.
Furthermore, during the trial of using the transformed Cancer dataset and a threshold of 0.05, the average accuracy value had the highest rate of 96.68%, with 26 selected features.Also, during the trial test for the average time required, the proposed threshold had the same result as the CBFS, which was 0.08 secs with 15 selected features.Therefore, the average accuracy value was only 94.41%, with a standard deviation of 0.2385.

Contraceptive method dataset
The third trial was carried out on the Contraceptive Method dataset having 1,473 instances, with 9 features.By using the proposed threshold, a standard deviation value of 0.0324, the average value is 0.0383 was obtained, with four selected features.Furthermore, the average accuracy value generated at this threshold was higher (51.64%), and outperforming the rate generated by the CBFS and 0.05.However, for the execution of the average time required, the three methods being compared had the same result, which was 0.25 secs.
Also, the trial using the proposed threshold yielded the highest average accuracy value (51.74%), wcompared to the CBFS algorithm with 0.05, through the use of only 4 features and a standard deviation rate of 0.0342.The average time required for execution was faster when using 0.05, compared to the proposed CBFS algorithm and threshold, which only took 0.21 secs.

Dermatology dataset
The fourth trial was conducted on the Dermatology dataset, which had 366 instances with 33 features.By using the proposed threshold average accuracy value was higher compared to that of the CBFS method and 0.05.The average value is 0.4205, a standard deviation of 0.2363with 26 selected features.However, for the execution of the average time required, the proposed threshold method only took 0.04 secs.The trials conducted on the Dermatology group were different from those carried out on the other datasets.The Dermatology dataset had a missing value, making it unable to be directly transformed using FFT and IFFT.Therefore, in this test, all missing values were filled with a value of 0 (zero), prior to transformation.
After the transformation, tests were carried out to obtain the average time and accuracy value, as required.Moreover, the CBFS produced the highest average accuracy value (97.70%), compared to the other two methods.Also, in terms of the average time required, the 0.05 and proposed threshold, had the same result of 0.07 secs with a standard deviation of 0.2359.

Divorce dataset
The fifth test was carried out on the Divorce dataset, which had 170 instances with 54 features.By using the selected 0.05 and the proposed threshold, the result of the average accuracy value was the same (97.65%),with a standard deviation of 0.1896 and a mean value is 0.6559.Also, both the 0.05 and proposed threshold had selected features of 54 & 52, respectively.However, the average time required was less at the proposed threshold, which was only 0.02 secs.
During the tests on the transformed Divorce dataset, the average accuracy value using the 0.05 threshold, was the highest (97.71%), with the required time being the same as that of the CBFS, which was 0.02 secs.Therefore, as regards the proposed threshold, the average accuracy value had a slight difference with 0.05, which was 97.65% with 53 selected features, and a standard deviation of 0.1920.

Electrical grid dataset
The sixth trial was carried out on the Electrical Grid dataset, which had 10,000 instances with 14 features.By using the CBFS, it was discovered that the selected features were 9, with an average accuracy value of 100%.Also, for selection by setting 0.05 and the proposed threshold value, the average accuracy rate was the same (100%), the mean value is 0.1009 with a standard deviation of 0.2546.Moreover, 5 features were selected for the 0.05, while the proposed threshold value had to settle for 1.However, the average time was faster than using the method with the proposed threshold, which was at 0.17 secs.
The tests on the transformed Electrical Grid dataset discovered that, by using the CBFS, the average accuracy value was higher than the other two methods, having 85.64% with 9 selected features.The average time required was less by using a threshold of 0.05, which was 2.57 secs.Therefore, the average time needed for the proposed threshold was slightly longer (3.44 secs), with a standard deviation of 0.0334.

CNAE-9 dataset
The seventh test was conducted on the CNAE-9 dataset, which had 1,080 instances with 857 features.The mean value of accuracy of the proposed method is 88.05%, with 65 features selected, a mean value of 0.0121, and a standard deviation of 0.0402.The average accuracy value is higher than the CBFS and a threshold of 0.05.However, the average time needed was less when using the CBFS algorithm (0.27 secs), compared to the proposed threshold.Therefore, there was a difference of 0.31 secs less, compared to the average time required by the proposed threshold.
The tests on the transformed CNAE-9 dataset observed that, by using 0.05, the average accuracy value produced was higher (90.69%), compared to the CBFS algorithm and the proposed threshold, with 57 selected features.The average accuracy value was slightly higher than that of the proposed threshold, with the difference being 0.2% with a standard deviation of 0.0402.Therefore, the average time needed was less by using the CBFS algorithm, which was only 0.27 secs.

Urban land cover dataset
The eighth trial was carried out on the Urban Land Cover dataset, having 168 instances and 148 features.By using the CBFS, the average curation value was 87.68%, with the number of features at 148.Also, in terms of the average time needed, this method had a value of 0.06 secs.However, the mean accuracy value of the proposed threshold is 84.76%, with 57 features selected, and the average required time is 0.07 s.The mean value is 0.4883, and the standard deviation is 0.4536.
The tests on the transformed Urban Land Cover dataset observed that, the average accuracy value at the 0.05 and proposed threshold had the same rate (69.73%), with 178 selected features, and a standard deviation of 0.0078.Therefore, as regards the average time required, the CBFS had a lesser value (21.52 secs), compared to the 0.05 and proposed threshold.

Epilepsy dataset
The ninth trial was conducted on the Epilepsy dataset, which had 11,500 instances and 179 features.At the 0.05 and the proposed threshold, the average accuracy value had the same result of 69.60% with 178 features selected, the mean value is 0.2939, and a standard deviation of 0.0078.Moreover, the average accuracy value was observed to be higher than the Correlation-Base Feature Selection.However, for the average time required, the CBFS method had a lesser value than the two compared methods, which was 15.72 secs.
The tests on the transformed Epilepsy dataset discovered that, the average accuracy value at the 0.05 and proposed threshold was higher than the CBFS algorithm, at 69.84% with 178 features selected, and a standard deviation value of 0.0078.Therefore, the average time needed was less by using the CBFS algorithm, which was only 18.20 secs.

SCADI dataset
The tenth trial was carried out on the SCADI dataset, which had 70 instances and 206 features.In the Correlation-Base Feature Selection, 19 features were selected with an average accuracy of 84.14%.The average accuracy value is better than that generated by the proposed threshold, namely 83.86% with 64 selected features, an average value of 0.1793, and a standard deviation of 0.2118.However, for the average time needed at the proposed threshold, the result was faster, taking only 0.01 secs.
The tests on the transformed SCADI dataset observed that the average accuracy value when using the CBFS, was higher than that of the other two methods, which was 85.86% with 16 features selected.Therefore, the average time required for Correlation-Base Feature Selection, 0.05 and the proposed threshold, was the same (0.02 secs).

Accuracy and speed on dataset transformation
Tests on the transformed features using the proposed threshold found that the average accuracy value is no better than the original or transformed dataset.It is only in the Electrical Grid dataset that the average accuracy of the features transformed using the proposed threshold is similar to other methods by 100%.Meanwhile, the transformed feature with the proposed threshold of 90% of the trial results yields a less average time.

The features elimination
In this study, the elimination of the feature in each method is quite diverse.
The feature elimination for the IG method with the proposed threshold on the transformed features experienced a high reduction.The reduction rate was 7.14% to 99.44% for feature transformation using imaginary numbers and 7.14% to 99.44% without imaginary numbers.This means the highest reduction is in the features transformed and uses the IG method with the proposed threshold.

Discussion
Based on the results of the trials conducted using K-Cross Fold Validation, with a value of K = 10, the following was obtained.

Original dataset
1.The proposed threshold was compared with the Correlation-Base Feature Selection, The average trial accuracy using the original dataset showed that 60% of the proposed threshold method produced higher parameters than the CBFS algorithm, with 10% having the same rates.
Eight out of ten datasets have a standard deviation value higher (>) than the mean value.All datasets need a lower average time than the Correlation-Base Feature Selection method and the Information Gain method with a threshold of 0.05.

The feature elimination
The feature elimination in the proposed threshold method was transformed without imaginary numbers.Therefore, the features reduced from 92.86% to 99.44%, with Fig. 6 used to compare the features eliminated in each method.
The average accuracy and time required for the entire dataset in this experiment are shown in Tables 1 and 2. Similarly, the comparison images are shown in Figs. 2, 3, 4, 5.

Conclusion
From the trials on the original dataset, that the following was concluded.
1.The original dataset's average accuracy value showed that the proposed threshold method produced higher parameters than when using the CBFS and 0.05 techniques.Moreover, 40% of the original dataset result in higher average accuracy values, with 30% having increased rates similar to the 0.05 method.Therefore, 70% of the tested datasets produced higher average values than the other two methods.2. Using the proposed threshold value, 50% of the datasets used resulted in less average execution time with better mean accuracy than CBFS and 0.05 techniques.Furthermore, 10% of the dataset yielded the same average time with the 0.05 method.There-  4. Based on the experiment, if the standard deviation value is less than or equal to (< =) the mean value, then the accuracy value is superior.5.The proposed threshold method with transformed features without imaginary numbers significantly reduces features by 92.86% and 99.44%, thereby accelerating the execution process.
The datasets transformed using FFT and IFFT showed the following result; 1.The average accuracy value from the proposed threshold was insignificant compared to the CBFS and 0.05 techniques.The proposed threshold only produced an average accuracy value of 30% higher than the dataset.2. The proposed threshold yielded an insignificant average time during transformation, with 70% of the transformed dataset taking longer.Therefore, only 30% of the dataset required less time than the 0.05 method and CBFS.3. Based on the experiment, if the standard deviation value is higher ( >) than the mean value, then the time needed is less (faster).4. The Random Forest execution feature selection on 70% of the transformed dataset increased the average accuracy value between 0.01 and 2.61%.Furthermore, 60% took less time than the original groups, with 10% requiring the same period.Therefore, the difference in the time needed was between 0.01 and 3.05 s.
The trial results showed that several things need to be considered, as follows Transformations do not need to be used on datasets with incomplete data.Furthermore, pre-processing is required for the data set to be complete.Transformation features can be proposed for further research by combining feature selection and extraction methods such as Principal Component Analysis (PCA), Neural Network, or Singular Value Decomposition (SVD).
The implementation of FFT and IFFT in the dataset needs to be considered, especially in the IG method with the proposed threshold.
Execution time (speed) and accuracy value are inversely proportional to variables, which means that preference is required.If you need to find a superior average accuracy value, you can use IG with a threshold standard deviation using the original data set.Meanwhile, to get is increased average speed using transformation features.

y entropy yc 4 .
Entropy y = −sum(Pi * log 2(Pi)) gain y, A = entropy y − � C∈vals(A) y c Performing the selection process, by removing features having an IG value below that of the threshold.5.The results of the feature selection were used to carry out the Random Forest prediction process, through the Cross-Validation Method Fold = 10.

Fig. 4
Fig. 4 Comparison for the mean value of transformation dataset accuracy Comparison of average accuracy valuesfore, 60% of the tested dataset led to a shorter average time than the CBFS and 0.05 techniques.3. Using the transformed feature with the proposed threshold without imaginary numbers, 70% of the trials produced a faster average time, though the average accuracy value obtained is insignificant.
SCADICOMPAR ISON OF AV ERAGE ACC U R A C Y VAL U EAccuracy SD FFT -Not IMG Accuracy SD FFT -With IMG Accuracy SD Accuracy Threshold 0.05 Accuracy CBF Fig.2

Table 1
Comparison of the average time required for the transformation dataset Test results for accuracy and time required

Table 2
Test results for accuracy and time of the entire transformation dataset