Optimizing IoT intrusion detection system: feature selection versus feature extraction in machine learning

Internet of Things (IoT) devices are widely used but also vulnerable to cyberattacks that can cause security issues. To protect against this, machine learning approaches have been developed for network intrusion detection in IoT. These often use feature reduction techniques like feature selection or extraction before feeding data to models. This helps make detection efficient for real-time needs. This paper thoroughly compares feature extraction and selection for IoT network intrusion detection in machine learning-based attack classification framework. It looks at performance metrics like accuracy, f1-score, and runtime, etc. on the heterogenous IoT dataset named Network TON-IoT using binary and multiclass classification. Overall, feature extraction gives better detection performance than feature selection as the number of features is small. Moreover, extraction shows less feature reduction compared with that of selection, and is less sensitive to changes in the number of features. However, feature selection achieves less model training and inference time compared with its counterpart. Also, more space to improve the accuracy for selection than extraction when the number of features changes. This holds for both binary and multiclass classification. The study provides guidelines for selecting appropriate intrusion detection methods for particular scenarios. Before, the TON-IoT heterogeneous IoT dataset comparison and recommendations were overlooked. Overall, the research presents a thorough comparison of feature reduction techniques for machine learning-driven intrusion detection in IoT networks.


Introduction
The Internet of Things (IoT) refers to the technology of connecting everyday life and devices to the internet.IoT is growing and changing quickly, with the goal of linking things like wireless sensors, smart cameras, televisions, and other smart home devices online [1].The number of Internet-connected IoT devices is rising rapidly, with over 2 billion connected in 2017.Experts predict there will be over 7.5 billion IoT devices generating 73.1 zettabytes of data by 2025 [2].While IoT devices are becoming widespread and assisting people in many areas, they often have very limited security capabilities.This is despite the huge growth of IoT and the large amounts of data it creates.In summary, IoT adoption is surging, connecting billions of devices and generating massive data.However, IoT devices typically lack strong security protections even as their use proliferates.
Due to the security limitations of IoT devices, it is crucial to create network intrusion detection systems (NIDS) that can quickly and dependably detect and prevent attacks on IoT networks [3].For this purpose, many machine learning techniques have been developed for intrusion detection in IoT, along with public datasets of network traffic [4].However, these datasets frequently contain numerous irrelevant or redundant features, which negatively impacts the complexity and accuracy of machine learning models [5].A common approach to develop efficient NIDS is through feature reduction, which decreases the dimensionality of network traffic data fed into the machine learning model.This helps lower computational costs and latency while enhancing model generalization.
Two of the most common are feature selection and feature extraction, which help address the issues caused by excessive features.Feature selection selects a subset of the most informative features from the original set [6].It reduces dimensionality while retaining the semantic interpretability of the selected features.In contrast, feature extraction transforms the original features into a new low-dimensional space via mathematical projection [7].While it can effectively reduce dimensionality, however, the extracted features lose intuitive meanings.In the realm of IoT security, feature selection enables the creation of lightweight and efficient IDS by judiciously choosing a subset of the most relevant original features.On the other hand, feature extraction techniques offer a valuable means to transform and distill the essence of the original feature set, reducing overall data dimensionality while retaining critical information.By optimizing the efficiency and interpretability of intrusion detection models, both feature selection and feature extraction become indispensable tools for enhancing the cybersecurity posture of IoT ecosystems, ensuring effective threat detection in a manner tailored to the limitations and intricacies of IoT devices and networks.
While existing works have focused on using either feature selection, feature extraction or hybrid method of the two, to improve certain performance metrics for NIDS [8,9], there remains a research gap in comprehensively comparing these two methods, especially on modern IoT datasets [10].Very few studies have evaluated the trade-offs between detection accuracy and computational complexity under the same experimental settings.However, such a comparison is essential to provide guidelines for choosing the appropriate feature reduction technique based on the IoT system constraints and intrusion detection requirements.
Therefore, this research aims to conduct an in-depth investigation of feature selection and feature extraction for building lightweight NIDS tailored to IoT environments.We focus on comparing the two techniques because they take contrasting approaches to reducing dimensionality, and may have different advantages and limitations in the context of IoT-based NIDS [11].The findings can provide data-driven insights to guide the selection of feature reduction methods for optimal efficiency and detection performance in IoT network protection systems.In summary, our work addresses the gap in comparative studies on feature reduction techniques for machine learning-driven NIDS on IoT data.By benchmarking feature selection and extraction head-to-head, we derive valuable guidelines for striking the right balance between detection accuracy and complexity in IoT environments.
This comparative study reveals that feature selection and feature extraction have different strengths and weaknesses for building lightweight NIDS on IoT data.Our experiments demonstrate that when a substantial number of features are reduced, feature selection generally achieves higher detection accuracy, demanding less training and inference time.Conversely, as the number of features decreases, feature extraction excels over feature selection.Additionally, examining the F1-scores for different attack classes under various feature quantities using various machine learning classifiers provides a deeper insight into the detection capabilities of both methods.This analysis reveals that while feature extraction shows less sensitivity to changes in the number of reduced features, it also demonstrates the ability to detect a wider array of attack types compared to feature selection.Moreover, both methods favorite Decision Tree classifier considering both classification metrics and run time performance, which is more suitable for NIDS in IoT network.Based on these observations, we present a detailed theoretical guide, elaborated in Table 20 within "Result verification statistically" section, to aid in the selection of the most appropriate intrusion detection method for distinct scenarios.
The key contributions in this paper are provided as follows.
1) A comprehensive performance evaluation between feature selection and feature extraction, involving performance metrics and run-time using the IoT data set, is conducted and evaluated.
2) The 3-phase machine learning pipeline framework, involving data preprocessing, feature reduction, and classification with multiple machine learning classifiers, is created for performance evaluation.
3) The NIDS for IoT is tested using public IoT datasets, named Network TON-IoT [10], to build models and compare performance between two feature reduction methods.
The subsequent sections are structured as follows: "Related works" section explores previous studies linked to this research, "Methodology" section explains the proposed methodology, "Experimental setup and analysis" section details the experimental setup and analysis, "Result and analysis" section displays the outcomes and discussions of two feature reduction techniques, and lastly, "Conclusion" section concludes this paper.

Related works
In this section, related studies on NIDSs that were implemented using feature reduction methods are discussed.
In the realm of NIDS, there has been widespread use of feature selection to reduce the complexity of the original traffic data.Many studies employ filter-based feature selection method to select the discriminate feature towards target class.For instance, in study [8], Mutual Information (MI)-based approach was proposed to select the features for NIDS, the study compared both linear and non-linear, specifically correlation-based and MI-based feature selection techniques, while MI-based outperform correlation-based approach on accuracy of attack detection.After that, Ambusaidi et al. [12] introduced a feature selection algorithm that utilized MI in combination with an variant support vector machine classifier.This approach exhibited enhanced accuracy and decreased model complexity compared to prior methods, on datasets such as KDD Cup 99, NSL-KDD [10] and Kyoto 2006+ [13].
In study [14], the authors conducted an analysis of a dataset named UNSW-NB15 [15] for NIDS.The filter-based feature reduction technique using machine learning algorithm such as XGBoost algorithm was applied to select features.In the same way, Disha and Waheed [16] designed feature ranking based on Gini Impurity by Random Forest (RF) to analyze the classification performance for NIDS using the latest TON-IoT dataset, while did not consider too much on computational cost for feature reduction process.However, most of the datasets as networking dataset are outdated as the benchmark data sets to evaluate classification models in NIDS for IoT security.
Furthermore, many studies use wrapper-based feature selection to find out the best feature subsets to improve the classification performance.Shafiq et al. [17] introduced a feature selection method called CorrAUC and a wrapper-based FS algorithm that employs the area under the curve (AUC) metric to choose effective features for machine learning (ML) algorithms.The method was tested on the Bot-IoT dataset [18] with four ML algorithms and the approach effectively selected informative features, however, it had lower precision for certain attacks like keylogging attack.
In addition, various techniques employing heuristic optimization algorithms, such as genetic algorithms (GA) as a search strategy to identify optimal feature subsets are detailed in [19][20][21].These methods demonstrated lower false alarm rates compared to baseline approaches, using datasets like UNSW-NB15 and KDD99.In study [22], The researchers utilized the Pigeon Inspired Optimizer (PIO) for the feature selection process, binarizing the continuous pigeon inspired optimizer and contrasting it with the conventional approach for binarizing continuous swarm intelligent algorithms.The evaluation was conducted on datasets including KDDCUP99, NLS-KDD, and UNSW-NB15, showcasing outcomes that demonstrated a high detection rate and accuracy while minimizing false alarms.In addition, some studies designed lightweight models to meet the characteristic of IoT network, Liu et al. [23] proposed Particle Swam Optimization (PSO) with one-class Support Vector Machine (SVM) [24] optimized PSO for feature selection with light GBM to build lightweight models for detecting attack.However, it is worth noting that these feature selection strategies often come at a high computational cost, especially when relying on GA, PSO, or machine learning-based classifiers, as a result, which have negative impact on resource-constraint IoT system and networks.
Moreover, many studies employed hybrid feature selection methods to improve the performance of the attack classifiers while reducing overfitting in model training task.
In study [25], the authors utilized association rule mining and central attribute values outperformed NSLKDD when tested on the UNSW-NB15 dataset.In addition, some studies employed ensemble feature selection techniques to find out the significant features, for example, Moustafa et al. [26] employed an ensemble Intrusion detection technique, which combined DT, ANN and NB as the base learners to learn the optimal features from statistic flow features, while Leevy et al. [27] employed information gain, information gain ratio, and Chi-squared (Chi 2 ) feature ranking techniques for feature selection.However, the cost for computation is overlooked with the purpose of improving the performance metrics.
As a response to this challenge, researchers investigated a correlation-based feature selection method that offers a more computationally efficient solution for NIDS, considering the correlation among features [6].This approach was initially applied to the KDD99 and UNSW-NB15 datasets in [28].More recently, Moustafa et al. [26] proposed correlation-based method which was improved for multivariate correlation-based network anomaly detection systems, moreover, Gavel et al. [29] employed correlation-based fitness function using the ant lion optimization to select features using AWID dataset for wireless network.Zhou et al. [30] chose the optimal features by removing the redundant features and selecting the most informative features based on the threshold of correlation.These works lead to a substantial improvement in NIDS accuracy, albeit with increased complexity.In light of the need for real-time and low-latency attack detection solutions, this study will place greater emphasis on the correlation-based feature selection method.
Unlike feature selection, which maintains a subset of initial features in Network Intrusion Detection Systems (NIDS), feature extraction focuses on condensing the original features into a lower-dimensional vector while preserving much of the data and applied in various research domains.In the research domain of image processing and pattern recognition, feature extraction involves transforming raw data, such as images, into a reduced and more meaningful representation [31].The primary goal is to capture essential information that is relevant for subsequent analysis, classification, or recognition tasks.For example, Miseikis et al. [32] employed a multi-objective convolutional neural network to extract features, identify and precisely localize the robot in 2D camera images, allowing flexibility in camera movement and providing accurate 3D position estimates for the robot base and joints.Aggarwal [33] explored the use of the Grey-level Co-occurrence Matrix (GLCM) feature extractor in classifying brain tumor MRI images with a random forest classifier.The results indicate that GLCM features with optimal parameters can achieve promising accuracy in capturing significant texture components.Various methods, such as principal component analysis (PCA), linear discriminant analysis (LDA), and autoencoders (AE) based on neural networks, have been utilized for reducing dimensions in NIDS.
For example, in [34], the KDD99 dataset's dimensionality was greatly reduced by PCA, improving NIDS performance and accuracy while handling attack classification via support vector machines.Various PCA variants, such as hierarchical PCA neural networks using 1998 DARPA dataset [35] and kernel PCA with genetic algorithms [36] have been adopted for intrusion detection to improve precision for less common attacks.PCA is also employed to recent network traffic datasets like UNSW-NB15 and CICIDS2017 can be found in [37,38].Additionally, LDA has been utilized as a feature reduction method in NIDS to notably decrease computational complexity, as seen in [39].In [40,41] the combination of PCA and LDA were employed to build a two-layer dimension reduction approach, effectively reducing dimensionality, and detecting low-frequency malicious activities over the NSLKDD dataset.
To improve efficiency of feature extraction in NIDS, various research works have applied AE-based neural networks.In particular, Yan and Han [7] introduced a stacked sparse AE approach to build non-linear mapping of high-dimensional to low-dimensional data over the NSLKDD dataset.Khan et al. [42] employed a deep stacked AE to reduce the number of features for both binary and multiclass classification, achieving higher accuracy than previous methods.Several AE-based networks on long short-term memory (LSTM), including variational LSTM [43] and bidirectional LSTM [44], have been developed for dimensionality reduction in NIDS, addressing imbalances and high-dimensional problems effectively.However, it's worth noting that AE-based methods, derived from deep neural networks, entail higher computational costs in both training and testing compared to statistical-based PCA and LDA algorithm.
To mitigate the computational costs issue, a network pruning algorithm was recently proposed to build lightweight detection model in [45] to significantly reduce the complexity of AE structures for feature extraction in NIDS, using UNSW-NB15 and CICIDS data sets.Moreover, in [46], a network design integrates an autoencoder (AE) network using convolutional and recurrent neural networks to extract spatial and temporal features without human intervention.
Since there is a wide range of studies that employed various feature reduction or dimensionality reduction techniques, which can be classified into two methods, namely feature selection and feature extraction, to build lightweight detection models for NIDS.However, few studies conduct comprehensive comparison for the performance and efficiency between the two methods, particularly for IoT data.For example, Aminanto et al. [9] combined AE-based feature extraction and supervised machine learning feature selection to learning representations of the original features, without performance comparison between them, while [47] only conducted comparison for two methods using traditional networking data set UNSW-NB15.
It's important to highlight that most of the previously mentioned studies have concentrated on enhancing either the accuracy of detection or reducing the computational complexity of Network Intrusion Detection Systems (NIDS).They accomplished this by utilizing machine learning classifications and feature engineering methods such as FS and FE to minimize data complexity.Nonetheless, the existing literature lacks a comprehensive comparison between these two feature reduction methods with current datasets in IoT networks.Our study endeavors to fill this gap.
In particular, we initiate the creation of a machine learning-driven NIDS framework utilizing diverse IoT data, emphasizing the feature reduction evaluation phase.Within this context, we identify feature selection through the correlation matrix and feature extraction using PCA as promising approaches for practical low-latency NIDS operations.We then perform an extensive assessment using the contemporary TON-IoT dataset derived from a heterogenous IoT network, comparing performance measures for detection.This includes accuracy, precision, recall, F1-score, and runtime intricacies such as feature reduction time, model training time, and inference time for these methodologies.Our evaluation encompasses both binary and multiclass classifications while maintaining consistency in the quantity of selected or extracted features.

Methodology
In this section, we put FS or FE technique as module factor into the pipeline of machine learning-based network intrusion detection system (NIDS) respectively, according to the final performance metrics of the classification models.Here is the framework of the methodology according to Fig. 1, which can be divided into three phases, data pre-processing, feature reduction, and classification.A detailed explanation of the three workflow of the proposed model is provided as following, particularly for two feature reduction methods.

Dataset
Below is the key information about the TON-IoT Network dataset, which will be employed in our experiments detailed in "Experimental setup and analysis" section.Subsequently, a comprehensive discussion on data preprocessing for this dataset will be provided.TON-IoT dataset was generated from heterogeneous data sources collected from Telemetry datasets of IoT and IIoT sensors, operating systems datasets of Windows as well as Ubuntu network traffic datasets.It was first introduced in [48], and this dataset comprises 22,339,021 instances of data and includes two target classes: the "label" class, containing normal and attack data, and another class with ten categories-normal and nine attack types, such as Backdoor, DDoS, DoS, Injection, Password, Ransomware, Scanning, XSS, and MITM.There are six feature groups: Connection, Statistical, DNS, SSL, HTTP, Violation, and Labeling, holding a total of 45 features in the original data.However, in this research, our data analysis involves the "Train_Test_Network.csv" dataset, comprising both training and testing sets, totaling 461,043 records.Table 1 displays the distribution of labels for the binary class and types for the multiple-class, while Table 2 shows the dataset's respective features.

Phase 1 data preprocessing
Data preprocessing refers to the process of transforming raw data into a clean, consistent, and meaningful format that can be used for analysis.It plays a vital role in ensuring the quality and suitability of data for machine learning based classification models in IoT security [49].Thus, to achieve accurate and reliable results, proper data preprocessing of IoT datasets is crucial.As described in the methodology framework, feature elimination, missing value handling, duplicates removal, non-numerical features encoding, and normalization, after that, data splitting is implemented to split the original data into training set and test set, in which, the training set is used for following normalization, feature reduction, and model training process, while the test set will be set aside for final model prediction for both binary and multi-class classification.

Feature elimination
To maintain the generalization of the models that can be used for real-scenario classification models in IoT networks, the features that represent the identifiers of the test environment in which the data was generated are eliminated.The "ts" feature represents the timestamp of each connection, while 'src_ip' , 'src_port' , 'dst_ip' , 'dst_ port' stands for the identifier of each instance, all these features are not significant as the predictors for following model training [50], therefore, after eliminating the unnecessary features in this stage, 38 features are left in the data set, with the exception of the two labels.

Missing value handling
Since all the "-" values among the features means not available from the perspective of networking domain knowledge, for example, the connection feature named "service" that has "-" value, which means the instance does not have the service value.
Similarly, the instances that have "-" value in DNS features means that the instances are not DNS-capable instances.In the same way, the remaining features involving SSL, HTTP and Violation features that contain "-" value, means these instances do not support the SSL, HTTP and Violation capability.Thus, we replace it with the value "n/a, " which means it is not available for this feature, and will create a corresponding new feature, named "<feature_name>_n/a, " as detailed in "Non-numerical features encoding" step.

Duplicates removal
After investigate the dataset, there are 11,071 rows duplicated in the data set, thus, we need consider the mechanism to handle.Because duplicate instances cannot contribute to meaningful data to the model building process, we directly drop the duplicate instances.Here we remove the duplicated instances and leave the unique ones in the dataset.Now the remaining the dataset of 449,972 rows with unique instances is generated.

Non-numerical features encoding
Since non-numerical data can be used for model training process, while part of the original dataset of Network TON-IoT has 38 features, including 15 numerical features and 23 non-numerical features.Since there are many categorical features in the dataset, so we need to convert the non-numerical features into numerical ones, so that the followed reduction and machine learning algorithms can process the data.Label encoding and one-hot encoding are methods for handling categorical variables in machine learning.The choice between them depends on the specific dataset and the ML algorithm we use.Label encoding is simpler and more space-efficient, but it may introduce an arbitrary order to categorical values.One-hot encoding avoids this issue by creating binary columns for each category, but it can lead to high-dimensional data [51].In our work, we implement different encoding scheme considering the characteristics of various features in the dataset.
We employ the one-hot encoding method for the connection features "proto, " "service, " and "conn_state" because they all have distinct and finite values.The new features will be encoded as "proto_icmp, " "proto_tcp, " and "proto_udp" features with binary values like 0 or 1.For instance, the "proto" feature has the values "icmp, " "tcp, " and "udp." The only difference is that "service_n/a" and "conn_state_n/a" will be generated since the original features contain "n/a, " which is not available in the original features.Otherwise, the same scheme will be applied to features "service" and "conn_state." Regarding the DNS features, such "dns_query, " one-hot encoding or directly applying the label may not be the best course of action due to the feature's numerous possible values.As a result, we employ a binary encoding approach to classify this feature as either DNS request available or not, indicating whether or not the instance has DNS requests.Regarding the characteristics "dns_AA", "dns_RD", "dns_RA", and "dns_rejected", we convert the non-numerical features into numerical ones using a one-hot encoder.
Regarding the weird features, the non-numerical weird features are "weird_name" "weird_addl" and "weird_notice".For "weird_name" binary encoding will be used, and for "weird_addl" and "weird_notice" one-hot encoding will be used.
Consequently, as presented in Fig. 2, the number of features will increase from the initial 38 to 77 in this step once the aforementioned features are encoded, many of which are not particularly useful in classifying attacks.To reduce the complexity of machine learning models during the classification stage, it is therefore required to condense such a vast number of attributes into a small number.Different encoding schemes are used for the non-numerical features above based on the features' qualities in the dataset, enabling the transformed data to proceed to the next phase.

Data splitting
Data splitting is to split the original data into two sets, one is training set which is used for model training, while the other sets named test sets is used for model test, or the final performance evaluation for the trained model.However, to avoid data leakage in the following step of data transformation, such as normalization and feature reduction, and following machine learning process, data splitting was implemented before that [51].
Moreover, in order to verify the effectiveness of the trained model, the proportion of the classes of the test data set keeps nearly the same class distribution as the training set to simulate the real scenario of IoT networks.Thus, we use stratified splitting scheme to split the dataset into training and test data with the proportion of 80:20, in which the 80% of the dataset will be used for model training to improve the performance of the final model, while the remaining percent will be used for model evaluation.As a result of the data splitting, the Fig. 2 The features of network TON_IoT before and after numerically encoded distribution and the specific number of instances of the normal/attack class and 10 classes for binary classification and multi-classification purposes, respectively, are shown in Figs. 3, 4.

Normalization
Normalization is used to keep the scale of the feature without bias to the features with large values.In machine learning, two commonly used feature scaling techniques are normalization and standardization.The studies [25,47] used normalization technique to scale the features, thus, we use min-max scaling to normalize the data.The data in this experiment were normalized between the range of 0 and 1 using min-max scaling.The normalization formula is shown as Eq. ( 1): (1) where X is the original value of the data point, X normalized is the normalized value of the data point.X min is the minimum value of the variable, while X max is the maximum value of the variable of the data set.
As demonstrated in Algorithm 1, we developed our preprocessing technique based on the preceding procedures discussed above in the phase 1 data pro-processing.

Feature selection
There are a series of feature selection techniques implemented in NIDS for IoT security, such as Gini-impurity [3], Chi-square [4], Information Gain [5], Mutual Information [25] and Feature Correlation [29,33,50].In this work, we focus on employing feature correlation to pick informative features based on the range of the given threshold because it has been found to attain competitive detection accuracy and complexity when compared to other selection equivalents.The correlation between each feature and the target variable is typically calculated in a correlation-based feature selection approach.In this methodology, we implement correlation-based feature selection based on Pearson correlation coefficient technique [46], by selecting features that are not correlated with each other to reduce multicollinearity.The defined correlation score threshold values based on the correlation score are set iteratively till final full number of features of the dataset, to build classifiers using five machine learning models, which will be explained in phase 3.
The Pearson's correlation coefficient (PCC) represents a straightforward linear correlation approach used to evaluate feature interdependencies.Employing this correlation-based technique, our objective is to select features highly correlated to other features.This selection process relies on the correlation matrix computed from the preprocessed training set following the steps outlined in "Phase 1 data preprocessing" section.To calculate the correlation coefficient between feature f1 and f2, the PCC is derived from the formulated features f1 and f2 as follows: where cov is the covariance and σ is the standard deviation, while M f 1 and M f 2 indicate the means of f1 and f2 respectively.
The Pearson correlation coefficient is a measure of the linear correlation between two variables, ranging from − 1 to 1.A coefficient of 1 indicates a perfect positive correlation, a coefficient of − 1 indicates a perfect negative correlation, and a coefficient of 0 indicates no correlation.The closer the coefficient is to 1 or − 1, the stronger the correlation between the variables.
The average correlation score for each feature with others is then calculated based on algorithm 2. The average correlation scores provide a summary measure of the overall correlation tendency of each feature with respect to all other features in the dataset.A higher average score indicates a feature that, on average, tends to be positively correlated with other features, while a lower average score suggests a feature with weaker or more varied correlations.The assumption of the features to be selected is based on the independence of each feature, in which features with weak or no correlation might be more independent, potentially contributing unique information to the model [46].Algorithm 2 Calculating average correlation score for each feature in phase 2 Since the average score of each feature is calculated, the next step is to define the threshold or range of the average score in order to select a different range of features for the benchmark number of features for comparison with feature extraction, followed by model training and validation.The criteria of the range are based on two aspects: one is to select the features from a small size with increasing features until a full set of features is covered; the other aspect is that the number of selected features is based on the threshold or range of average scores based on the overall average scores of the features, which will be detailed and visualized in "Features selected based on correlation thresholds" section.
Furthermore, we only need to consider such feature correlation during the training phase, while during the testing phase, we directly select the selected features from the original high-dimensional training data set to generate the reduced-dimensional test data in Fig. 1's feature reduction module.In contrast, in feature extraction, the PCA technique is applied on both the training and test data sets to reduce dimensionality, which will be considered as the run duration of the feature reduction operation.

Feature extraction
There is a series of feature extraction methods used in IoT security domain, involving PCA [39], LDA [52], and AE [53], while PCA and AE stand out as the mostly used extraction methods applied in NIDS for IoT security.Unlike feature selection, which uses chosen features to map those in the original dataset, these feature extraction techniques use a projection matrix or an Autoencoder-based neural network learned from a training dataset to condense the high-dimensional data into lower-dimensional data.It should be noted that the AE approach often deals with the higher computational complexity associated with deep neural networks (DNN), resulting in greater latency as compared to PCA.Consequently, this study exclusively focuses on the PCA-based feature extraction approach, a choice driven by the imperative need for resource-constraint IoT devices and low latency NIDS to protect IoT network from cyber threats.
Principal Component Analysis (PCA) is a powerful dimensionality reduction technique used to transform high-dimensional data into a lower-dimensional representation, while retaining the most important information.The mechanism of PCA is based on the calculation of eigenvectors and eigenvalues of the data's covariance matrix.The equation that underlies PCA is as follows: Here C represents the covariance matrix of the standardized data X consisting of N sam- ples, and X T is the transpose of X .The matrix C captures the relationship between fea- tures in the data.
The projection matrix W k is a key component in PCA and represents the transforma- tion matrix that projects the original data onto the first k principal components.Each column of W k corresponds to a principal component.
Here V 1 , V 2 , V 3 , . . .V k are the eigenvectors corresponding to the top k eigenvalues of the covariance matrix.
The matrix that performs the projection of the data onto the first k principal components is often represented as W k , where each column is a principal component.The projection is as followings: Here X proj is the projected data onto the first k principal components, W k is the trans- pose of the projection matrix.
Then reconstruction of data can be implemented by the from the first k principal components as: which illustrates how well the original data can be approximated using the reduced set of principal components.
After that, Scree Plot Calculation for each extracted feature is calculated as following: Here is the i-th eigenvalue of the covariance matrix, d j=1 j is the sum of all eigenvalues, the Scree Plot values indicate the proportion of total variance explained by each principal component.It shows the explained variance for each principal component, can be calculated by arranging the eigenvalues in decreasing order.
In the training phase, we commence by preparing the dataset for PCA.This step involves fitting the PCA model to the training data, capturing the principal components, and simultaneously transforming the data accordingly.The algorithm operates by first standardizing the data by subtracting the mean and dividing by the standard deviation for each feature to ensure that all features have the same scale, and then computing the covariance matrix C .Eigenvalues and eigenvectors are derived from this matrix C , and by sorting the eigenvalues in descending order, the most significant components are selected.A projection matrix is constructed from these eigenvectors, enabling the data to be projected onto a lower-dimensional subspace.The result is a compact representation of the data, capturing the essential information while reducing dimensionality, making it useful for simplifying machine learning models.For comparison purpose, the defined number of components in PCA are exactly the same as the number of features selected by feature selection for each iteration.
In the testing phase, it's a common practice in machine learning workflows to fit the PCA model on the training data and then use the learned transformation to transform both the training and test datasets.This approach ensures that the same transformation is applied consistently to both sets of data, maintaining the relationship between the (8) , principal components [39].When dealing with test data, the pre-fitted PCA transformation is applied using the transform method, ensuring consistency in the application of learned transformations across both training and test datasets.In addition, different from the run time of feature selection for test set ignored in performance evaluation, that of PCA for test data is calculated as the whole run time in feature reduction in phase 3.
As demonstrated in Algorithm 2, we developed and analyzed our feature reduction algorithm based on the preceding procedures discussed above in the phase 2 feature reduction.
Algorithm 3 Feature reduction in phase 2

Phase 3 attack classification
In this phase, for classification tasks, we choose following five classic machine learning models mostly utilized in recent works for NIDS, to implement comprehensive comparison among different classifiers between two feature reduction methods.The specific hyperparameters of the models will be explained in "Experimental setup and analysis" section.

Decision tree (DT)
The decision tree classifier is a widely used machine learning model that aims to create a tree-like structure of decisions based on the features of the data [54].It works by recursively splitting the data into subsets based on the feature that best separates them, typically using measures like Gini impurity or information gain.The main benefit of decision trees lies in their interpretability and ability to handle both numerical and categorical data.The primary goal of the algorithm for decision tree classification is to use a cost function to find the optimal splits.Decision trees can be applicable to IoT network intrusion detection due to their transparency and ease of understanding which features are critical for detecting attacks.The Gini impurity is used in this work as the splitting criterion, selecting a feature for splitting at each stage of the tree training, as Eq.(11) illustrates: where D is the training dataset, C is a set of class labels, and P (i) is the percentage of samples that have class label I in C. In C, the Gini impurity is 0 when there is just one class.

Random forest (RF)
The random forest classifier is an ensemble method that builds multiple decision trees and combines their outputs to make predictions [55].Each tree is trained on a random subset of the data with replacement (bootstrap samples) and a random subset of features.This ensemble approach reduces overfitting and improves prediction accuracy.The main benefit of random forests is their robustness and ability to handle high-dimensional data.However, they may not provide as much interpretability as single decision trees.The algorithm behind random forests aggregates the results from multiple decision trees.Random forests can be particularly useful for IoT intrusion detection, as they offer a good balance between accuracy and interpretability.The Gini impurity as presented in Eq. ( 11) is also used as a split criterion.

k-Nearest neighbors (kNN)
The k-nearest neighbors classifier is a simple instance-based learning model that classifies data points based on the majority class among their k nearest neighbors in feature space [56].It operates on the assumption that similar data points share the same class label.The main benefit of kNN is its simplicity and effectiveness for non-linear data.However, it can be sensitive to the choice of the distance metric and the value of k.The algorithm calculates distances (e.g., Euclidean distance) between data points to find the k-nearest neighbors.In IoT intrusion detection, kNN can be useful when there is a need to adapt quickly to new attack patterns and anomalies.Due to its widespread usage as a distance metric, the Euclidean Distance was selected.Equation ( 12) defines the Euclidean Distance Equation as follows: (11) where the Euclidean distance function between the two samples is represented by d (x, y), x i is the first observation, y i denotes the second sampling of the data, and n denotes the number of observations.

Naive Bayes (NB)
The Naive Bayes classifier is a probabilistic model based on Bayes' theorem, which calculates the probability of a data point belonging to a specific class given its feature values [57].It assumes that features are conditionally independent, which is a simplifying, albeit "naive, " assumption.Naive Bayes is computationally efficient, particularly for text classification tasks, and can handle high-dimensional data.However, its performance may suffer if the independence assumption is violated.The algorithm calculates class probabilities using Bayes' theorem.In IoT intrusion detection, Naive Bayes can be useful when computational resources are limited and there is a need for quick training and classification.Bayes' theorem is expressed in Eq. ( 13): where P(L|X) is the posterior probability of class L, P(L) is the prior probability, P (X | L) is the likelihood function, and P(X) is the probability, these parameters are estimated using the training set.

Multi-layer perceptron (MLP)
The Multi-Layer Perceptron classifier is a type of artificial neural network that consists of multiple layers of interconnected nodes (neurons) [58].It can learn complex nonlinear relationships in data through a process called backpropagation.MLPs are highly flexible and can approximate any continuous function, making them suitable for various tasks.However, they require a larger amount of data for training and careful tuning of hyperparameters to prevent overfitting.The algorithm involves feedforward and backpropagation steps, where weights are updated to minimize the error between predicted and actual outputs.In IoT intrusion detection, MLPs can be applied when the data is highly complex, and feature engineering has been performed effectively.
As shown in Algorithm 4, we created above five classifiers and trained it with 80% of the dataset samples, and tested it with the remaining 20% samples for performance evaluation between feature selection and feature extraction.

Experimental setup and analysis
We present an extensive set of experiments examining the performance of the NIDS using feature selection and extraction methods outlined in "Methodology" section.This evaluation involves a variety of machine learning-based classification models.Our comparison entails performance metrics such as accuracy, precision, recall, F1-score, and MCC elaborated in "Performance evaluation" section.Both binary and multiclass classifications are evaluated, and we also explore model training and inference times to evaluate detection method efficiency.Additionally, our thorough comparison of FS and FE methods offers valuable insights into their impact on performance metrics.This includes a comparison with and without feature reduction, providing guidance on selecting the appropriate detection techniques for specific IoT network scenarios.

Experimental setup
Table 3 details the setup of the computing platform, hardware, its operating system, and a variety of software information utilized for constructing the NIDS framework in this work.

Performance evaluation
We analyze the following metrics in order to evaluate the performance comprehensively: accuracy, precision, recall, F1-score, MCC, model training time, and inference time.True positive (TP), true negative (TN), false negative (FN), and false positive (FP) are the four words used to describe these measurements.For the purpose of assessing the capacity for particular class classification, confusion matrices based on the four factors are also proposed in this study.F1-score is determined specifically based on precision and memory as follows, which is regarded as a harmonic mean of precision and recall.The Matthews Correlation Coefficient (MCC) is a metric that takes into account true and false positives and negatives, providing a balanced measure of classification performance.It ranges from − 1 to 1, where 1 indicates perfect prediction, 0 indicates no better than random chance, and − 1 indicates total disagreement between prediction and observation.All the performance metrics Eqs. ( 14)-( 18) are shown as follows: As to the evaluation of model efficiency, since either feature selection or feature reduction must go through the data-preprocessing stage, so we do not take this step into account, and focus on the evaluation of run time of feature reduction, model training using training set, as well as model prediction using test set.In particular, feature reduction time consists of the amount of time needed to compute (Feature Calcuation) and choose the reduced features (Feature Selection) until the data set containing the reduced features is updated, which is then fed into machine learning models using the following formula, Eq. ( 19): (14)

Hyperparameter settings of classifiers
To perform binary and multiclass classification tasks, we employ five machine learning models from the Python Scikit-learn library: Decision Tree (DT), Random Forest (RF), K-nearest Neighbours (kNN), Gaussian Naive Bayes (NB), and Multi-layer Perceptron (MLP).The hyperparameter settings for each model are described in Table 4.

Features selected based on correlation thresholds
Correlation scores matrix is implemented using Pearson correlation algorithm and the result is presented as Fig. 5.It provides a comprehensive view of the relationships and dependencies among different variables.The accompanying heatmap visually enhances the interpretability of these correlations, using a color spectrum to emphasize the strength and direction of the relationships.The Average correlation score of each feature will be calculated in the next step based on the scores in this matrix.
As we can see from Fig. 6, which displays the average correlation score among the features, we manually define the range of the thresholds based on the result of the average score in the figure.In order to cover all ranges of the size of feature subsets, we manually select the features with the least average correlation score until the (20) Training Time = Time Model Training .Thus, we choose the features based on the proposed range of the average correlation score.Following the selection of a certain number of features, the same number of components will be computed and chosen for the performance comparison analysis with feature extraction using PCA.
As a result, in Table 5, we present lists of 9, 22, 33, 47, and 77 (full features) selected features, as well as the corresponding average correlation thresholds used to achieve those numbers of selected features, in order to implement comprehensive comparison for a better understanding of two feature reduction methods.

Features extracted based on PCA
We extract the same number of features as the features selected by feature selection for evaluation and comparison.In our study, we presented the explained variance score of each extracted feature under different schemes (7,22,33,47, and 77 extracted features), and the percentage of the total variance in the original dataset that is captured by each principal component.In other words, it quantifies the amount of information that each principal component retains from the original features.In Fig. 7, after performing PCA to the dataset, the principal components are ordered by the amount of variance they explain.For each scree plot, the first principal component explains the most variance, the second explains the second most, and so on, with the bar chart presented for 7, 22, 33, 47, and 77 extracted features, respectively, which is the same as the number of the features selected for evaluation and comparison purposes.In addition, the explained variance is expressed as a ratio or percentage of the total variance.Thus, higher explained variance ratios indicate more significant contributions of principal components in capturing the dataset's variability.It helps in making informed decisions about the number of components to retain for following tasks, such as model training and validation.

Binary classification
Initially, we explore the performance and runtime of feature selection and extraction methods in binary classification, presented in Tables 6, 7, 8, 9, and 10.For every feature number scheme, five iterations are carried out in order to get an affirmative conclusion.The average result is then computed using the outcomes of each iteration.These tables showcase the performance metrics and times for 9, 22, 33, 47, and 77 features (full features) selected or extracted, respectively.The highlighted values, in bold and red, denote the superior outcomes for both feature selection and extraction.These best Fig. 7 The explained variance and cumulative total variance for extracted feature schemes   Regarding the classification performance, we initially explore the impact of an increasing number of features on the performance of both FS and FE methods.Expanding the number of features appears to enhance the performance of the FS model, while this increase shows no obvious effect on the FE model.Figure 8 illustrates that as the number of features increases, the performance of FS models generally improves from 9 features to 77 full features.In contrast, the performance of FE models remains nearly consistent, with the exception of the kNN model, which displays optimal performance with 33 features, as indicated in Fig. 9.While the performance of the best models in FS improves as the number of features increases from Tables 6, 7, 8, 9, and 10, the performance of certain models, like the decision tree, significantly decreases from Tables 8, 9, and 10.This trend aligns with the expectation  that as the number of selected features increases, more irrelevant or noisy features might emerge, potentially impacting the detection performance negatively.Furthermore, the performance of FE in classification surpasses that of FS, particularly when considering a small number of features.The comparison between the two methods in Fig. 10 reveals that when the number of reduced features is relatively small-9, 22, 33, and 47-the classification performance of FE notably outperforms that of FS.This advantage is particularly pronounced for the cases involving 9 and 22 features.For instance, as demonstrated in Table 6, using the DT classifier, the highest accuracy and F1-score of FE are 86.54% and 85.62%, respectively, while FS exhibits lower performance with 80.73% accuracy and 78.44% F1-score using the same DT classifier.However, as the number of features increases, for instance, to 47 and 77 full features in Tables 9 and 10, the effectiveness of FE gradually diminishes relative to FS.In the case of full features, both FS and FE favorite RF classifier to achieve the best performance metrics compared with other classifiers, Furthermore, FS exceeds FE in Fig. 8 The best performance of FS models for binary classification Fig. 9 The best performance of FE models for binary classification terms of accuracy and F1-score, with 88.22% and 87.69%, respectively, surpassing FE with 87.04% and 86.14% under the same RF classifier.
As for which model two feature reduction methods prefer, the favorite models of FS and FE are different when the number of reduced features are different.Tables 6  and 7 demonstrate that with FS, the DT classification method consistently delivers the highest accuracy, precision, recall, and F1-score.This is followed by the MLP model, which becomes more favorable in Tables 8 and 9, and ultimately, the RF model emerges as the optimal choice with the full feature set in Table 10.In contrast, the FE method exhibits a different pattern, initially favoring DT with 9 features in Table 6, transitioning to MLP with 22 features in Table 7, and then showing stronger performance with the kNN classifier at 33 features, which marks the peak performance point.Subsequently, the RF classifier becomes the preferred choice for the FE method in Tables 9 and 10.
As for run time performance, we firstly investigate the run time of the two feature reduction methods.It is shown from Fig. 11.The runtime efficiency of FE surpasses that of FS, especially with a smaller number of features-specifically, 9, 22, and 33.However, the disparity in runtime between the two feature reduction methods narrows as the reduced features increase.This occurs because the FS algorithm demands extra computational resources to compute the average correlation score for every feature, making it more time-consuming compared to the PCA-based feature extraction, which compresses high-dimensional data into a lower-dimensional format, as detailed in "Methodology" section.
Moreover, as for the model training time, FS takes less time than FE with small number of features, such as 9 and 22 for all the models according to the Tables 6 and 7, particularly for the model training time of DT and RF for all feature settings based on Tables 6, 7, 8, 9, and 10.However, the training time of kNN, NB, and MLP in feature selection exceeds that in feature extraction when reduced features are increasing, such as 33, 47 and 77, except for the case of MLP in 77 full features, in which the training time under FS is 70.78 s, while that with features under feature extraction is 86.44 s.
The inference time comparison between models using FS and FE shows a consistent superiority in favor of the FE model for all feature settings except the case with 22 features.Notably, the DT classifier remains the optimal choice for both feature reduction methods in minimizing inference time.The DT classifier stands out among other classifiers for reducing both training and inference times.Similarly, the kNN classifier exhibits the shortest training time but considerably prolonged inference time, while the NB classifier, despite its weaker accuracy performance, demonstrates modest computational efficiency.
Finally, in order to better understand the attack detection performance of FS and FE, we evaluate and compare the class-wise f1-score, namely normal and attack, using the best model of FS and FE respectively in each feature setting, involving 9, 22, 33, 47 and 77 full features.We can refer to the Fig. 12, the F1-score of both normal and attack traffic improve little with the increasing number of features for both feature selection and extraction, which demonstrate the effect of feature reduction method to achieve good performance with less run time of created models.Moreover, the F1-score of normal traffic is obviously higher than that of attack traffic for all feature settings, that's because of the class imbalance before normal and attack case in training set, which makes sense since class imbalance handling, which is not the focus of our work, is not implemented.
In addition, we find that the performance of FE can achieve the highest performance result with more fewer features, for example, FS for both DT and MLP need 33 features to achieve highest F1-score, while, FE can achieve the same performance value using 9 features on DT classifier, and 22 features on MLP classifier.Moreover, FE is less sensitive Fig. 12 The class-wise F1-score between FS and FE methods for reduced features to different models' differing feature counts, such as DT and MLP, while that of FS, varies significantly compared with that of FE, according to the Table 11.However, F1-score of FS for both DT and MLP improves significantly when the number of features increase from 9 till 33, particularly for the attack class, compared with that of FE, which proved that the performance can be improved when more informative features are added.Moreover, based on the outstanding models highlighted in the table, we further present Table 11 Class level F1-score analysis between FS and FE in binary classification Fig. 13 The confusion matrix of the outstanding models between FS and FE in binary classification the confusion matrix of the outstanding models in Fig. 13.We can find out that FS outperforms FE slightly on normal traffic classification, while provides less capability to recognize the attack traffic.

Multiclass classification
Next, we investigate the performance and computational time analysis of both FS and FE methods for multi-class classification, using Tables 12, 13, 14, 15, and 16.The same as the logic of the binary classification, five iterations are performed for each feature number scheme in order to obtain an affirmative result.The results of each cycle are then used to calculate the average result.The tables highlight the best values in bold and underscore the superior results for both feature selection and extraction in 'bold and italics' , following the same criteria applied in the binary classification outlined in "Binary classification" section.
The performance of multi-class classification performance is significantly lower than that of binary classification, such as the accuracy and f1-score of the best trained model     decision tree in multi-classification are 72.65% and 29.39%, while those of are 80.73% and 78.44% in binary classification, when using 9 selected features.Because more complex sub-class distribution of the data and less training data for each class, especially for the rare attack instances such as MITM attack, which cause low detection rate compared with normal/attack binary models.However, we concentrate on the performance comparison of two feature reduction methods under multi-class classifiers.We firstly investigate how performance changes with increasing number of features for both feature selection and feature reduction method.As the same outcome of binary classification, the increasing number of reduced features can improve the performance of feature selection model, while there is no obvious effect for that of feature extraction model.It is shown from Figs. 14 and 15, when the number of reduced features increases, the classification performance of feature selection generally improves, particularly from that of 9 features to 47 selected features, while that of feature extraction has no significant improvement, particularly for the accuracy.In addition, more features can also cause performance degrade for both feature selection and feature extraction.For example, the f1-score of FS model decrease when the number of features increased from 47 to 77 (full) features, while that of FE model decrease starting from 33 features, since more noisy or irrelevant features are expected to worsen the detection performance.
Moreover, the classification performance of FE is much better than that of FS especially for small number of features, which is the same as that of binary classification.As Fig. 14 The best performance of FS models for multi-class classification Fig. 15 The best performance of FE models for multi-class classification it is shown from Fig. 16, comparing the two feature reduction methods, we find that the classification performance of feature extraction is much better than that of feature selection, when the number of features is relatively small, such as 9 and 22.For instance, the highest accuracy and F1-score of DT model with feature extraction is 77.04% and 40.66% respectively, while those of feature selection is lower with 72.65% and 29.39%.However, when there is increasingly larger number of features added, starting from 33, 47 till 77 full features, the performance gap between FS and FE is not significant, and only the precision of FS is higher than that of FE, when the number of features is 33, 47 and 77 full features.
In addition, different from the binary scenario, where MLP is the best classifier for feature selection, and KNN is the best classifier for feature extraction, with 33 features, while the favorite models of multi-class classification using FS and FE are different.It is showed from the Figs.14 and 15, DT outperforms other classifiers in both FS and FE, when the number of features is relatively small, such as 9, 22, 33 features, while MLP achieve higher performance than other models, when that of features increase to 47 till 77 full features, that is because DT as less complex tree-based model can handle the data with limited features, however, MLP as more complex neural networks can handle the data with more features.
As for run time performance, we firstly investigate the run time of the two feature reduction methods.Since there is no change for the two feature reduction algorithms in binary classification and multiple classification, thus the run time for multi-class classification is the same as that of binary classification, which is explained in "Binary classification" section.Moreover, as for the model training time, the same as that of binary classification, FS takes less time than FE when the number of features is relatively small, such as 9, 22, and 33 for all the DT models according to the Tables 12, 13 and 14, However, when MLP as the best performance model in the number of features 47 and 77 full features, the model training time of FE is significant lower than that of FS, according to the Tables 15 and 16.As for the inference time of the best performance models, DT is obviously more efficient than that of MLP for both feature selection and feature extraction for all feature settings.Inference time of FS is lower than that of FE when limited number of features are used, such as 9, however, in contrast, more time is used than for FS than FE when the number of features increased.Finally, our comparison focuses on the F1-scores for detecting individual attack types, involving 10 attack classes and 1 normal class (as outlined in "Methodology" section) across Tables 17, 18, and 19.These tables outline the performance of FS and FE using 9, 22, 33, 47 features, and 77 full features.Within these tables, our emphasis remains on the DT and MLP classifiers for FE and FS, respectively, to achieve optimal detection performance, as previously discussed.Observations from Tables 17 and 18 indicate that FE generally outperforms FS across most classes, except for the injection attack in Table 19.Notably, both methods achieve higher F1-scores for specific classes such as DDoS, Normal, Scanning, and XSS in contrast to other classes.
Remarkably, the multiclass classification accuracy of FE proves less affected by the number of reduced features compared to FS.A significant finding emerges from FS's inability to accurately detect any MITM samples, even with the best models, across all numbers of features.In contrast, FE using the best classifiers can successfully detect  Further exploration into the matter includes a comparison of the F1-scores for each class between the two feature reduction methods, using the same Decision Tree and MLP classifiers in Tables 18 and 19, respectively.Table 19 shows that, similar to FS, FE with the same MLP classifier is unable to detect any MITM attack samples accurately.These tables demonstrate that FE, employing the same classifier, tends to identify a broader range of attack types compared to FS.For example, FS with DT classifiers fails to detect Injection, Password, and Scanning attacks under 9 features, while its counterpart successfully identifies these attacks across all five numbers of features.The disparity arises from FE's ability to extract crucial information from all available features, enabling detection of a wider range of attack types, unlike the FS approach that predominantly relies on a subset of selected features highly correlated to specific attack types.
Additionally, we look into the confusion matrix of the excellent models in Fig. 17 based on the models that stand out in the table.In contrast to the binary classification result, we can observe that FE marginally beats FS in the usual traffic identification task.In particular, DT_FE_33 and MLP_FE_22 are both capable of identifying normal traffic more accurately than DT_FS_47 and MLP_FS_47, respectively.In terms of attack categorization, DT_FE_33 is more capable than MLP_FE_22; on the other hand, MLP_FS_22 is less capable than MLP_FE_22 of classifying backdoor assaults, while MLP_FE_22 is less capable than MLP_FS_47 of recognizing DoS attacks.No model hyperparameter adjustment or class imbalance optimization may result in a loss of the ability to detect every attack.However, since the goal of this study is to distinguish between feature extraction and feature selection, we can do optimization to improve classification performance even more in the future.

Result verification statistically
In order to come up with the affirmative conclusion, statistical verification is implemented using T-test in this section.Two metrics are used for verification including Table 19 Class-wise F1-score comparison between FS and FE in multiclass classification of the same MLP t-statistic and p-value, for example, the negative sign indicates that the average of one group is significantly below the average of the other group.The t-statistic is a measure of how many standard deviations a data point is from the mean of the distribution.The p-value is a measure of the evidence against a null hypothesis.In the context of a t-test, it represents the probability of obtaining the observed results (or more extreme) if the null hypothesis is true.A small p-value (typically less than the significance level, e.g., 0.05) suggests that we can reject the null hypothesis.The statistical   21 that have been validated using the t-test method are listed here; for information on the other conclusion features, see the tables and figures in "Binary classification" and "Multiclass classification" sections.In summary, when evaluating binary and multiclass classification within the NIDS utilizing the Network TON-IoT dataset, FE emerges as not only offering superior classification performance but also reduced feature reduction time compared to its FS counterpart, particularly as the number of reduced features increases.The advantage of FE is notably higher and more consistent, especially with smaller quantities of features, such as 9 and 22.However, FS generally demonstrates shorter model training and inference times than feature extraction, which is significant to lightweight model design of NIDS in IoT network.
Among the five classifiers, the DT proves to be the optimal choice for enhancing the performance of both feature reduction methods, particularly when the number of features remains small, such as 9 and 22. Conversely, a neural network-based MLP exhibits superior performance for both reduction methods as the number of features increases, reaching values of 33, 47, and 77 full features (refer to the Figs.14, 15).It's important to highlight that FE shows less sensitivity to changes in the number of reduced features compared to FS, a trend that remains consistent across both binary and multiclass classifications.A detailed, comprehensive comparison between two feature reduction methods in NIDS using contemporary IoT dataset is provided in Table 21 for further insights.

Conclusions
IoT systems and networks always suffer from computational resource constraints, which impact the applicability of attack classification model training, validation, and deployment for cyber security in real IoT scenarios.Feature reduction is pivotal for constructing a cost-effective and lightweight model capable of classifying attacks in IoT scenarios.Specifically, the objective is to mitigate the challenges associated with resource constraints in IoT devices by reducing the number of features through a thorough evaluation of feature selection and feature extraction methods.In this study, we conducted a thorough comparison between two dimensionality reduction methods, FS and FE, using the contemporary and heterogenous Network TON-IoT dataset for classification in NIDS.Our extensive analysis revealed that, when reducing a significant number of features (e.g., 9 or 22), FE not only achieved higher accuracy in attack detection but also required less time for dimensionality reduction.However, as the number of features increased (e.g., 33 or more), FS outperformed feature extraction.Therefore, FS demonstrated more potential with fewer features, whereas FE showed room for improvement with a larger number of features.Additionally, we observed that the effectiveness of FS declined significantly with an increased number of selected features, while FE consistently improved.Our study identified the MLP as the optimal classifier for FE, while the DT was the top performer in FS providing the highest accuracy in attack detection.Both two reduction methods favor the DT for lightweight classification models.Moreover, we found that FE is less sensitive to changes in the number of features and can detect a broader range of attack types compared to FS.Both methods exhibit a tendency to detect more attacks, especially abnormal classes, when a larger number of features are selected or extracted.These insights offer valuable guidance for choosing the most suitable intrusion detection method in specific IoT scenarios.It's important to note that our assessment concentrated on two specific feature reduction techniques using classic machine learning algorithms on the TON-IoT dataset.Future research aims to explore the applicability of these findings across a variety of IoT datasets with different applications, such as IoTNIDS, BoT-IoT, MQTT-IoT-IDS, and Edge-IIoTSet.Moreover, our future plans involve conducting an extensive evaluation of additional feature reduction methods within authentic IoT environments, aiming to narrow the gap between academic offline analysis and real-time analysis in practical IoT scenarios.

( 1 )
Data preprocessing During this phase, the data is processed by cleansing, partitioning, and normalization to standardize the data format.The dataset is divided into two sets, training for feature reduction and testing for final model prediction.A detailed description is presented in "Phase 1 data preprocessing" section.(2) Feature reduction This critical stage employs FS or FE techniques to identify the most crucial attributes, thereby reducing data dimensionality.The transformed data through both methods is then utilized in subsequent classification tasks."Phase 2 feature reduction" section offers an in-depth description of feature reduction methods.(3) Classification modeling Various machine learning models, involving Decision Tree, Random Forest, k-Nearest Neighbors, Naive Bayes, and Multiple Layer Perception, are employed to validate the impact of the two feature reduction methods.These models perform binary and multiple classifications, offering a comprehensive comparison based on multiple performance metrics.

Fig. 1
Fig. 1 Framework of proposed NIDS for comparison of feature reduction methods

Fig. 3 Fig. 4
Fig. 3 Proportions of the normal/attack classes in training and test set of TON-IoT

Algorithm 4
Normal/attack classification in phase 3 The model training time refers to the training time of each classification model Model Training , as following, Eq. (20): Meanwhile, the inference time means the prediction time of machine learning classifiers Model Testing in the testing phase, as follow Eq.(21): In particularly, the run time of feature reduction involves transformation of training set and test set using corresponding feature selection or feature extraction algorithm, respectively.

Fig. 5 Fig. 6
Fig. 5 Pearson correlation score matrix of the features in network ToN-IoT

Fig. 10 Fig. 11
Fig. 10 The performance comparison of FS and FE models for binary classification

Fig. 16
Fig.16 The performance comparison of FS and FE models for multi-class classification

Fig. 17
Fig.17 The confusion matrix of the outstanding models between FS and FE in multi-classification

Table 2
Features description in network TON_IoT

Table 3
Hardware and software specifications of the implementation environment

Table 4
Hyperparameter settings of each model

Table 5
Correlation threshold with the features selected

Table 6
FS vs. FE for binary classification with 9 features

Table 7
FS vs. FE for binary classification with 22 features

Table 8
FS vs. FE for binary classification with 33 features values encompass the highest accuracy, precision, recall, F1-score, MCC, and the lowest feature reduction, training, and inference times within each table column.Time values for feature reduction and training are measured in seconds (s), while inference time per data sample is measured in milliseconds (ms).

Table 9
FS vs. FE for binary classification with 47 features

Table 10
FS vs. FE for binary classification with 77 (full) features

Table 12
FS vs. FE for multiclass classification with 9 features

Table 13
FS vs. FE for multiclass classification with 22 features

Table 14
FS vs. FE for multiclass classification with 33 features

Table 15
FS vs. FE for multiclass classification with 47 features

Table 16
FS vs. FE for multiclass classification with 77 (full) features

Table 17
Class-wise F1-score comparison between FS and FE in multiclass classification

Table 18
Class-wise F1-score comparison between FS and FE in multiclass classification of the same DT MITM samples with 9, 22, and 33 features.This observation is primarily attributed to the machine learning classifier rather than the chosen feature reduction method.

Table 20
Summary of t-test verification test

No. Data T-statistic P-values Significant
including the run time of both model building and inference) compared to other models for both FS and FE − 3.6216 0.0152 * test summary, Table 20, is based on the original data in binary and multi-classification results from Tables 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, and 16.The data in comparison Table

Table 21
Comparison between FE and FS in various scenarios of attack class degrades when number of features increases (binary) ✓ 16 Higher F1-score in detecting DDoS, normal, scanning and XSS classes ✓ 17 Higher F1-score in detecting injection classes ✓ 18 More potential to improve performance when the number of features is small ✓ 19 More potential to improve performance when the number of features is large ✓