Detecting cybersecurity attacks across different network features and learners

Machine learning algorithms efficiently trained on intrusion detection datasets can detect network traffic capable of jeopardizing an information system. In this study, we use the CSE-CIC-IDS2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. CSE-CIC-IDS2018 is big data (about 16,000,000 instances), publicly available, modern, and covers a wide range of realistic attack types. Our contribution is centered around answers to three research questions. The first question is, “Does feature selection impact performance of classifiers in terms of Area Under the Receiver Operating Characteristic Curve (AUC) and F1-score?” The second question is, “Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?” The third question is, “Does the choice of classifier: Decision Tree (DT), Random Forest (RF), Naive Bayes (NB), Logistic Regression (LR), Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?” These research questions are all answered in the affirmative and provide valuable, practical information for the development of an efficient intrusion detection model. To the best of our knowledge, we are the first to use an ensemble feature selection technique with the CSE-CIC-IDS2018 dataset.

downloadable from the cloud 1 . Nine files consist of 79 independent variables, and the remaining file consists of 83 independent variables.
Machine learning is greatly facilitated by the high number of features in CSE-CIC-IDS2018. Machine learning algorithms typically outperform traditional statistical methods in classification tasks [4,5]. However, the threshold settings of some learners may not be appropriately set for imbalanced data, thus rendering these algorithms inefficient at distinguishing majority and minority classes in a highly imbalanced environment. The learners will consequently fail to properly model the distribution of the positive (minority) class and become biased in favor of the negative (majority) class. Therefore, one must employ metrics that safeguard against this outcome. The two metrics used in this study, F1-score and Area Under the Receiver Operating Characteristic (ROC) Curve (AUC), are suitable for evaluating classifier performance on imbalanced datasets [6,7]. We note that class imbalance is more noticeable in big data as the number of majority class instances is disproportionately high in that environment [8,9].
The ensemble feature selection [10,11] approach in this paper is tailored toward improving classifier performance by using a relevant subset of variables from CSE-CIC-IDS2018. It is worth noting that feature selection also provides data clarity and reduces computation requirements. In our study, we utilize both supervised and filter-based [12] feature ranking techniques, and the last stage of our ensemble approach is the selection of common features from these techniques.
The specific properties of big data can make classification more challenging for learners trained on the 2018 dataset. These properties include volume, variety, velocity, variability, value, and complexity [8]. Traditional methods may have difficulty handling the high data volume, the diversity of data formats, the speed of data originating from different sources, data flow inconsistencies, the filtering of important data, and data linking and transformation.
Classifier performance in our case study is based on the training and testing of the following learners: Decision Tree (DT) [13], Random Forest (RF) [14], Naive Bayes (NB) [15], Logistic Regression (LR) [16], Catboost [17], LightGBM [18], and XGBoost [19]. These learners are selected for their good coverage of several Machine Learning (ML) model families and are viewed favorably in terms of performance [20]. The seven classifiers are further discussed in "Classifier development and metrics" section. To the best of our knowledge, this study is the most comprehensive analysis on CSE-CIC-IDS2018 to date. Our work uniquely uses the 2018 dataset to investigate ensemble feature selection on the performance of seven classifiers. Our contribution is defined by our responses to three research questions: The first question is, "Does feature selection impact performance of classifiers in terms of AUC and F1-score?" The second question is, "Does including the Destination_Port categorical feature significantly impact performance of LightGBM and Catboost in terms of AUC and F1-score?" And, our third question is, "Does the choice of classifier: DT, RF, NB, LR, Catboost, LightGBM, or XGBoost, significantly impact performance in terms of AUC and F1-score?" The answers to these research questions provide valuable and practical information for the development of an efficient intrusion detection model.
The remainder of this paper is organized as follows: "Related work" provides an overview of literature that manipulates features of CSE-CIC-IDS2018; "Methodology" section describes the cleaning process of the 2018 dataset, our unique ensemble approach for feature selection, the classifiers and metrics used in the study, and the training and testing procedure for these classifiers; "Results and discussion" section presents and discusses our empirical results; "Conclusion" section concludes our paper with a summary of the work presented and suggestions for related future work.

Related work
In this section, we highlight studies that modify features of CSE-CIC-IDS2018 to improve classification results. However, to the best of our knowledge, none of these studies use an ensemble feature selection approach.
To address the high class imbalance of the 2018 dataset, Hua [21] uses an undersampling and embedded feature selection approach with a LightGBM classifier. Undersampling [22] randomly removes majority class instances to alter class distribution. During the data cleaning stage, missing values and useless features were removed, resulting in a modified set of 77 features. String labels were converted to integer labels, which were then one-hot encoded. In addition to LightGBM, six other learners were evaluated in this research work: Support Vector Machine (SVM) [23], RF, Adaboost [24], Multilayer Perceptron (MLP) [25], Convolutional Neural Network (CNN) [26], and Naive Bayes. Learners were implemented with Scikit-learn [27] and TensorFlow [28]. The train to test data ratio was 70 to 30, and XGBoost was used to perform feature selection. LightGBM had the best performance of the group, with an optimum accuracy of 98.37% when the sample size was three million and the top ten features were selected. For this accuracy, the precision and recall were 98.14% and 98.37%, respectively. LightGBM also had the second fastest training time among the classifiers.
In another related work of research [29], five learners were evaluated on two datasets (CSE-CIC-IDS2018 and ISOT HTTP Botnet [30]) to determine the best botnet classifier. The ISOT HTTP Botnet dataset contains malicious and benign instances of Domain Name System (DNS) traffic. The learners in the study include RF, DT, k-Nearest Neighbor (k-NN) [31], Naive Bayes, and SVM. Feature selection was performed using various techniques, including the feature importance method [32] of RF. Subsequent to feature selection, CSE-CIC-IDS2018 had 19 independent attributes while ISOT HTTP had 20, with destination port number, source port number, and transport protocol among the selected features. The models were implemented with Python and Scikit-learn. About 80% of botnet instances were used for training, where five-fold cross-validation was applied. The remaining botnet instances served as the testing set. For optimization, the Grid Search algorithm [33] was used. With regard to CSE-CIC-IDS2018, the RF and DT learners scored an accuracy of 99.99%. Tied to this accuracy, the precision was 100% and the recall was 99.99% for both learners. The RF and DT learners also had the highest accuracy for ISOT HTTP (99.94% for RF and 99.90% for DT).
Li et al. [34], in a third related study, apply clustering and feature selection to CSE-CIC-IDS2018. This unsupervised learning study involves online real-time detection with an autoencoder classifier. An autoencoder encodes data in a way that usually results in dimensionality reduction [35]. For preprocessing, "Infinity" and "NaN" values were replaced by 0, and the data was subsequently divided into sparse and dense matrices, normalized by L2 regularization. A sparse matrix has a majority of elements with value 0, while a dense matrix has a majority of elements with non-zero values. The model was built within a Python environment. The best features were selected by RF, and the train to test data ratio was set as 85 to 15. The Affinity Propagation (AP) clustering [36] algorithm was subsequently used on 25% of the training dataset to group features into subsets, which were sent to the autoencoder. Recall rates for all attack types for the proposed model were compared with those of another autoencoder model called Kitnet [37]. Several attack types for both models had a recall of 100%. Only the proposed model was evaluated with the AUC metric, with several attack types yielding a score of 1. Based on detection time results, the authors showed that their model has a faster detection time than KitNet.
Fitni and Ramli [38] adopt an ensemble model approach to compare seven single learners for integration into a classifier unit. The seven learners are as follows: RF, Gaussian Naive Bayes [39], DT, Quadratic Discriminant Analysis [40], Gradient Boosting, and Logistic Regression. The models were built with Python and Scikit-learn. During preprocessing, samples with missing values and infinity were removed. Records that were actually a repetition of the header rows were also removed. The dataset was then divided into training and testing validation sets in an 80-20 ratio. Feature selection [41], a technique for selecting the most important features of a predictive model, was performed using the Spearman's rank correlation coefficient [42] and Chi-squared test [43], resulting in the selection of 23 features. After the evaluation of the seven learners with these features, Gradient Boosting, Logistic Regression, and DT emerged as the top performers for use in the ensemble model. Accuracy, precision, and recall scores for this model were 98.80%, 98.80%,and 97.10%, respectively, along with an AUC of 0.94.
Finally, D'hooge et al. include both CICIDS2017 and CSE-CIC-IDS2018 in a study investigating how efficiently the results of an intrusion detection dataset can be generalized [44]. CICIDS2017 is the predecessor of the 2018 dataset. For performance evaluation, the authors used 12 supervised learning algorithms from various families: DT, RF, Bag [45], gradient-boosted decision tree (GBDT), Extratree [46], Adaboost, XGBoost, k-NN, Ncentroid [47], linearSVC [48], RBFSVC [49], and Logistic Regression. The models were built with the Scikit-learn and XGBoost modules in Python. The authors used feature scaling, which is different from feature selection. Feature scaling attempts to normalize the feature space of all attributes. Results show that the tree-based classifiers yielded the best performance, and among them, XGBoost ranked first with many perfect values for F1-score and AUC. D'hooge et al. hinted overfitting might have been a problem and "further analysis" was warranted. We note that their source code indicated hyperparameter values of max-depth = 35 for some of their tree-based learners. Such values are prone to overfitting. For intrusion detection, the authors concluded that a model trained on one dataset (CICIDS2017) cannot generalize to another dataset (CSE-CIC-IDS2018).
In summary, the related works exhibit shortcomings with nearly perfect classification performance values typically associated with overfitting. We discovered additional shortcomings, such as errors in preparation (e.g. using Destination_Port as a numeric value instead of categorical value) and in data cleaning. Ambiguous specifications are also an issue with regard to reproducibility of the studies.

Data cleaning
Removing certain fields from CSE-CIC-IDS2018 was our first step in the data cleaning stage. We dropped the Protocol field because it is redundant, since the Dst Port (Desti-nation_Port) field mostly contains equivalent Protocol values for each Destination_Port value. We dropped the Timestamp field as we wanted the learners to not discriminate between attack predictions based on time, especially with more stealthy attacks in mind. In other words, the learners should be able to distinguish attacks regardless of whether they are high volume or slow and stealthy. Dropping the Timestamp field also allows us the convenience of combining or dividing the datasets into ways more compatible with our experimental frameworks.
We removed 59 records that were actually a repetition of the header rows. These were easily found and removed by filtering records based on a white list of valid label values.
The fourth downloaded file "Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv" was different than the other 9 files for the 2018 dataset. This file contained 4 extra columns: Flow ID, Src IP, Src Port, and Dst IP. We dropped these 4 additional fields.
Certain fields contained negative values which did not make sense, and so we dropped those instances with negative values for the Fwd_Header_Length, Flow_Duration, and Flow_IAT_Min fields. In particular, the negative values from the Fwd_Header_Length field occur with extreme values in other fields. These extreme values skew statistics that are sensitive to outliers.
Eight fields contained values of zero for every instance. Prior to the start of machine learning, we filtered out the following list of fields: We also excluded the Init_Win_bytes_forward and Init_Win_bytes_backward fields, since about half of the total instances contained negative values for these two fields. Similarly, we did not use the Flow_Duration field as some of those values were unreasonably low with zero values. The Flow Bytes/s and Flow Packets/s fields contained some "Infinity" and "NaN" values (with less than 0.6% of the records containing these values). We dropped these instances (total of 95,760) where either Flow Bytes/s or Flow Packets/s contained "Infinity" or "NaN" values.

Ensemble feature selection
After data cleaning, the number of independent features in CSE-CIC-IDS2018 is reduced to 66. We then adopt an ensemble approach for feature selection. Ensemble feature selection is derived from the concept of ensemble learning, which demonstrates that the combination of multiple learning approaches outperforms a single approach for the classification of instances. This intuitive concept has been extended from an ensemble of learners to an ensemble of feature ranking techniques, where distinct feature ranking methods are integrated to provide one ranking.
In our case study, we use seven ranking techniques to generate seven lists of features and subsequently process the resulting lists to select features. The ranking techniques assign a number to each of the 66 usable features in the 2018 data. We use this number to place an ordering on the features. When we apply a feature ranking technique, we select, at most, the top 20 highest ranked features. Our decision to use 20 features was motivated by two factors. First, we wanted to select a list of features long enough that there would be a good chance for different rankings to have elements in common. Second, due to hyper-parameter tuning to avoid overfitting, we set the maximum depth of CatBoost at 5, which causes CatBoost to construct constituent DTs that use only 14 features. Therefore, to keep CatBoost's ranking relevant, we settled on taking a maximum of 20 features from other rankers.
We employ both filter-based and supervised ranking techniques. Filter-based feature ranking techniques create a list of features by a statistic that we calculate for each variable. Supervised feature ranking techniques leverage the structures of constituent DTs of ensemble classifiers to generate a feature importance list. We refer to these ordered lists as "rankings. " Selected features appear in at least four out of the seven ranking techniques.

Filter-based techniques
The filter-based feature ranking techniques we use are based on the Information Gain (IG) (also known as Mutual Information) [50], Gain Ratio (GR) [51], and Chi-Squared (CS) [52] statistics. We use the value of the statistic calculated for each feature to filter the list of all features to a reduced list.
To calculate IG and GR statistics, we use the "info_gain" and "info_gain_ratio" functions from the info_gain Python library. To calculate the CS statistic, we use the "chi2" function that is included as part of the Scikit-learn library. One may employ the same method to rank features of a dataset with any of these 3 functions. We do not supply configuration parameters to the IG, GR, or CS functions when we invoke them. Each of the three functions accepts two arrays of data for input. We employ all 66 usable features of the 2018 dataset as the source of data for the first input array, and the "label" value of the 2018 data for the second input array. All three functions also return a list, l, of numbers where we use the value of the ith element of the list to determine its rank, r, relative to the values of the other elements in the list. To be concrete, we create a list l ′ of pairs (r i , i) from l, and then sort l ′ in decreasing order of r i . After sorting l ′ , we truncate it at the 20th element. If the feature ranking technique assigns an importance of zero to a feature, we do not include it in the list of ranked features. For instance, we find CatBoost assigns an importance greater than 0 to fewer than 20 features. Some readers may have reservations about the applicability of IG, GR, or CS to categorical or numeric features. We apply IG, GR, and CS feature selection techniques to CSE-CIC-IDS2018 network traffic data in a manner similar to Singh et al. in [53]. In their study, Singh et al. apply these techniques to the KDD CUP 1999 network traffic dataset. This dataset is similar to the 2018 dataset in that it contains numeric and categorical features. Therefore, we are comfortable applying these filter-based feature ranking techniques to the 2018 dataset. Table 2 contains the rankings for the three filterbased feature ranking techniques.
Through filter-based feature ranking, we obtain three out of the seven lists used to select features for our models. The remaining four lists of features are obtained with supervised feature selection techniques that are discussed in the next subsection.

Supervised feature ranking techniques
For the supervised feature ranking techniques, we use the feature importance lists from the RF, CatBoost, XGBoost, and LightGBM Python libraries. CatBoost, LightGBM, and XGBoost are Python libraries of their own. The RF implementation we use is part of the Scikit-learn Python library. Here, we discuss how we use elements in common to the implementations of RF, CatBoost, XGBoost, and LightGBM to employ them in ranking techniques. All four libraries have classifier objects. These objects have initialization (constructor) functions. One may pass configuration options to the initialization functions. Please see Tables 3, 4, 5, and 6 for the configuration options we use for each classifier object. Each object also has a "fit" function. After the classifiers' fit function is successfully invoked, the classifier object has a list attribute "feature_importances_". We use the feature_ importances_ list in the same way we use the list of values l returned by the functions for filter-based ranking techniques discussed in the previous subsection. Hereafter, we refer to the feature selection technique of using the feature importance values from Cat-Boost, LightGBM, XGBoost, and RF classifiers by the names of the classifiers, where not ambiguous.
As discussed in the previous subsection, CSE-CIC-IDS2018 has one categorical feature: Destination_Port. We found this feature has 53,760 possible values in the 2018    dataset. Hence, we concluded that finding an appropriate encoding technique for this feature is outside the scope of our study. However, CatBoost and LightGBM have builtin support for categorical features, so we include Destination_Port as a candidate for ranking for CatBoost and LightGBM, but not for XGBoost or Random Forest. All supervised ranking techniques yield 20 features, except CatBoost. Due to implementation details and the hyper-parameter settings we use, CatBoost will not construct DTs with a number of nodes sufficient to utilize 20 or more features in the training data. Hence, we find CatBoost provides rankings with fewer than 20 features. Tables 7 and 8 contain the top 20 features in all supervised rankings, except CatBoost, which ranks only 14 features.
We use supervised ranking techniques to generate 4 out of 7 rankings, and filter-based ranking techniques to generate the remaining 3 out of 7 rankings. We use the 7 rankings to conduct ensemble feature selection. In the following subsection, we cover the specifics of our feature selection techniques.

Feature selection
After obtaining the 7 rankings, our feature selection techniques are to select features that appear in k out of 7 rankings, where k has the value 4, 5, 6, or 7. Hence, we have 4 feature selection techniques, based on an ensemble of 7 feature ranking techniques. We refer to the set of features that appear in 4 out of 7 rankings as "feature group 1. " This is our first ensemble feature selection technique. Since feature group 1 contains the Destination_Port categorical feature which some learners that we use cannot consume directly, "feature group 1A" is the set of all features in feature group 1, but excluding Destination_Port. We believe Destination_Port is a valuable categorical feature, but is only usable for 5 out of 7 of our rankers (CS, IG, GR, CatBoost, and Light-GBM). This puts Destination_Port at a disadvantage for getting selected as a feature.
In later experiments, we would like to know how classifiers that can handle categorical features perform with or without Destination_Port. Therefore, for every set of features that a feature selection technique produces, we add or remove Destination_Port as necessary to end up with two sets of features-one that has Destination_Port, and one that does not. For CSE-CIC-IDS2018, when we inspect the 7 rankings for features in common, we find 15 features in feature group 1. We follow the naming convention similar to that of the feature selection technique where 4 out of 7 ranking techniques agree on a feature. Namely, if the result of a feature selection technique ends in ' A' , we mean it does not contain Destination_Port, and if it does not end in ' A' , it contains Destination_Port. Since Destination_Port does not appear in 5 out of 7 rankings, "feature group 2" is the result of applying the feature selection technique where 5 out of 7 rankers agree on a feature, then augmenting the result with Destination_Port. "Feature group 2A" is the set of features that appear in 5 out of 7 rankings. Similarly, we have "feature group 3", "feature group 3A", "feature group 4", and "feature group 4A" that we form in a manner similar to feature group 2 or feature group 2A. Feature group 3 and feature group 3A correspond to the case where 6 out of 7 rankings agree on a feature. Feature group 4 and feature group 4A correspond to the case where 7 out of 7 rankings agree on a feature. Hence, our 4 feature selection techniques net 8 groups of features depending on whether we include or exclude Destination_Port. Please refer to Tables 9, 10, 11, and 12 to see the resulting feature groups for all rounds of feature selection.  Table 9 Features appearing in at least 4 out of 7 rankings of CSE-CIC-IDS2018, referred to as "feature group 1"; we remove Destination_Port to form "feature group 1A" * Indicates Destination_Port not included in feature group 1A  Table 11 Features appearing in at least 6 out of 7 rankings of CSE-CIC-IDS2018, referred to as "feature group 3A"; we add Destination_Port to form "feature group 3"  Table 12 Features appearing in at 7 out of 7 rankings of CSE-CIC-IDS2018, referred to as "feature group 4A"; we add Destination_Port to form "feature group 4A" * Indicates Destination_Port not included in feature group 4A

Flow_IAT_Max * Destination_Port
After selecting features listed in Tables 9, 10, 11, and 12, the datasets are suitable for training and testing classifiers. The reader should not attach any significance to the order of features in Tables 9, 10, 11, and 12.
We create a total of 11 datasets. Four are the result of applying the four feature selection techniques, and another four are the result of adding or removing the Des-tination_Port categorical feature as needed. In order to assess the impact of feature selection, we require two more datasets, one that contains all 66 usable features, and a similar dataset with all features except Destination_Port. We call these datasets "all features" and "all features A", respectively. Finally, we have one dataset that contains only the Destination_Port feature, which we call "Destination_Port only". In the next subsection we review classifiers that we train and test with the 2018 dataset.

Classifier development
After feature selection, we use the resulting datasets as input to 7 classifiers: DT, RF, NB, LR, CatBoost, XGBoost, and LightGBM. A DT is a simple representation of observed data. The tree can be easily visualized, with nodes or leaves representing class labels and branches representing observations. RF is an ensemble approach building multiple decision trees. The classification results are calculated by combining the results of the individual trees, typically using majority voting. NB uses Bayes' theorem of conditional probability to determine the probability that an instance belongs to a particular class. It is considered "naive" because of the strong assumption of independence between features. LR uses a sigmoidal, or logistic, function to generate values from [0,1] that can be interpreted as class probabilities. LR is similar to linear regression but uses a different hypothesis class to predict class membership. CatBoost, XGBoost, and LightGBM are GBDTs [54], an ensemble of DTs sequentially trained. Catboost uses Ordered Boosting, which imposes an order on the samples that CatBoost uses to fit constituent decision trees. XGBoost uses a sparsity-aware algorithm and a weighted quantile sketch. Sparsity is the quality of having many missing or zero values, while a weighted quantile sketch uses approximate tree learning [55] to support merge and prune operations. LightGBM uses Gradient-based One-Side Sampling and Exclusive Feature Bundling to handle large numbers of data instances and features. One-Side Sampling ignores a substantial portion of data instances with small gradients, while Exclusive Feature Bundling groups mutually exclusive features to reduce variable count.
As stated in the discussion on feature ranking, LightGBM and CatBoost handle encoding of categorical features automatically, so we take advantage of that, and use Destination_Port as a feature for LightGBM and CatBoost. Since we pass the array of features to LightGBM as a Pandas DataFrame [56], we indicate to LightGBM that Destination_Port is a categorical feature by setting the data type of the Destination_ Port column to "category". In order to direct CatBoost to treat Destination_Port as a categorical feature, when we call the CatBoost classifier's initialization (constructor) function, we set the cat_features parameter to the value of a one-element list containing the string "Destination_Port".
We do one set of experiments where Destination_Port is the only feature. For these experiments we use CatBoost, LightGBM, and Scikit-learn's NB classifier for categorical data, CategoricalNB [57].
Before we train and test our classifiers, we initialize them with certain parameters. The settings of these parameters were selected based on experimentation. We list these initialization parameters in Tables 13, 14, 15, 16, and 17. We do not provide tables for initialization parameters for Naive Bayes or Logistic regression constructors because we did not set any for those two classifiers.

Classifier metrics
Our work records the confusion matrix (Table 18) for a binary classification problem, where the class of interest is usually the minority class and the opposite class is the majority class, i.e. positives and negatives, respectively. A related list of simple performance metrics [58] is explained as follows: Based on these fundamental metrics, other performance metrics are derived as follows: • Recall, also known as True Positive Rate (TPR) or sensitivity, is equal to TP/(TP + FN ). • Precision, also known as positive predictive value, is equal to TP/(TP + FP).
• Specificity, also known as True Negative Rate (TNR), is equal to TN/( TN + FP ).
In our study, we used more than one performance metric to better understand the challenge of evaluating machine learning models with severely imbalanced data. The metrics are explained below: • F1-score (traditional), also known as the harmonic mean of precision and recall, is equal to 2 · Precision · Recall/(Precision + Recall). • AUC is equal to the area under the Receiver Operating Characteristic (ROC) curve, which graphically shows recall versus (1-specificity) across all classifier  decision thresholds. From this curve, the AUC obtained is a single value that ranges from 0 to 1, with a perfect classifier having a value of 1.

Classifier training and testing
We train and test all classifiers using stratified fivefold cross-validation (CV). For training and testing any model, we use the "label" column of CSE-CIC-IDS2018 as the label. For each of the datasets that are appropriate for the classifiers, we do 10 iterations of 5-fold CV for all classifiers. Only CatBoost and LightGBM have built in support for categorical features, so all datasets are appropriate for CatBoost and LightGBM. However, we felt finding the optimal encoding technique as a preprocessing step is out of scope for this work, so it is not appropriate for NB, LR, DT, RF or XGBoost to be trained and tested with datasets that contain the Destination_Port categorical feature. We are interested in models that use solely the Destination_Port, and therefore we perform some experiments with the CategoricalNB classifier. Cate-goricalNB is appropriate for datasets that contain only categorical features; therefore, "Destination_Port only" dataset is appropriate for CategoricalNB. For each unique combination of dataset and classifier, since we do ten iterations of fivefold CV, we record 50 measurements of AUC and F1-score values. The AUC and F1-score figures we report are mean values of 50 measurements. After training and testing all models with the various datasets, we group the performance metrics according to experimental factors at different levels, to perform ANalysis Of VAriance (ANOVA) [59] tests. When the outcomes of the ANOVA tests indicate that factors explain variance in performance metrics, we perform Tukey's Honestly Significant Difference (HSD) [60] tests to determine which levels of factors are significantly different, and which are associated with highest mean AUC and F-1 score values. We then use the outcomes of the ANOVA and Tukey's HSD tests to answer research questions Q1, Q2, and Q3. The first results we report are the mean AUC and F1-scores for CatBoost, LightGBM, and CategoricalNB with their respective datasets. We present these results in Table 19. We report results on ANOVA and Tukey's HSD tests in the next section, but mention them here since the data in the tables below are used in the tests later.
For reasons given earlier, we do not train or test CategoricalNB on datasets with numeric and categorical features. However, it is appropriate to train and test Cat-Boost and LightGBM with datasets consisting of categorical and numeric features. We report the outcome of the first such experiments in Table 20.

Table 19 Mean performance of CatBoost, LightGBM and CategoricalNB in terms of AUC and F1-score on a one-feature dataset of Destination_Port only
Best metrics are highlighted in italics; SD AUC is the standard deviation of AUC and SD F1 is the standard deviation of the F1-score     In Tables 21, 22, 23 and 24 we report the results of further experiments involving Cat-Boost and LightGBM. In these tables, we train and test models on data with features in feature groups 2, 3, 4, and all features.
In Tables 25, 26, 27, 28 and 29 we report performance of the 7 classifiers CatBoost, LightGBM, DT, LR, NB, RF, and XGBoost as we train and test them on datasets with

Results and discussion
We inspect Tables 19,20,21,22,23,24,25,26,27,28 and 29, then use the same data we used to obtain the mean values in them to conduct ANOVA and Tukey's HSD tests to answer research questions Q1, Q2, and Q3. The confidence levels we use for all tests is 99%. For ANOVA tests, we build models where the dependent variable is the experiment outcome (AUC or F1-score), and the independent variables (classifier, feature selection technique, etc.) are factors in the experiments. Therefore, we include box plots of data used to build the models for ANOVA to aid in understanding groupings the Tukey's HSD tests identify. Research Question Q1: Does feature selection impact performance of classifiers in terms of AUC and F1-score?
Our first step in answering Q1 is to inspect Tables 20,21,22,23,24,25,26,27,28 and 29. In Table 20, we see the LightGBM model yields an AUC value of 0.96694 and an F1-score of 0.95880 when trained on the 15 features (including Destination_Port) from feature group 1. However, in Table 24, we see that, when trained with all features,  LightGBM yields an AUC value of 0.96890 and an F1-score of 0.96134. We see analogous patterns of similar or better performance for other classifiers in Tables 20,21,22,23,24,25,26,27,28 and 29 for other classifiers and datasets. Therefore, we perform two-factor ANOVA tests with classifiers and datasets as the factors, and AUC or F1-score as the dependent variable. In all cases, with one exception, the p-values for the ANOVA tests are zero, so we conclude that classifier and dataset are significant factors affecting the outcome of experiments. The exception is for experiments involving the dataset with one feature of Destination_Port only. For this one-feature dataset, the classifier choice is not significant. We report the results of these experiments with the one-feature Destination_Port only dataset in Table 19, where we are forced to report results to 8 decimal places instead of the usual 5 to show any difference in performance when we use different classifiers. Otherwise, Tukey's HSD tests are appropriate for both classifier and dataset factors.
The dataset factor (feature group 1, 1A, etc.) in an experiment is equivalent to the application of a feature selection technique. In order to get a sense of the impact of feature selection and classifier choice, we conduct Tukey's HSD tests at a 99% confidence level for the dataset and classifier factors in order to gauge the effect of feature selection. We see in Figs. 1 and 3 that performance in terms of AUC and F1-score is influenced by the classifier. Reflected in Figs. 2 and 4, and according to the groupings the Tukey's HSD test yields, there is no significant difference performance in terms of AUC for group a, which consists of the feature selection technique where . This is an ideal result since it implies we obtain similar performance with a smaller dataset. However, in terms of F1-score, we do not obtain the ideal result, but one where performance in terms of F1-score is similar. We see in Fig. 4 that the F1-scores for classifiers trained on all features in feature group 1A are very close to the F1-scores that classifiers trained with data from feature group 1 yield. In fact, the adjusted p-value for the Tukey's HSD test for the difference in F1-score for feature groups 1 and all features is 0.0100414. We cite this adjusted p-value as another reason to claim that performance in terms of F1-score for classifiers trained with feature group 1A is similar, or better than the performance of classifiers trained with all features from CSE-CIC-IDS2018. However, results for the feature selection techniques 2, 3, 4, 2A, 3A, or 4A do not show the same conclusion.
Only CatBoost and LightGBM have built-in support for categorical features. Therefore, we deem it out of scope to address encoding techniques for the Destination_Port categorical feature in the 2018 dataset. As a result, we perform separate experiments to assess the impact of feature selection to further answer research question Q1. We conduct ANOVA to determine if the classifier and feature selection technique have an impact on the results for AUC and F1-score. Since p-values for the classifier and feature selection technique factors are nearly zero for the ANOVA tests, we conduct Tukey's HSD tests to check the levels for factors that yield the best performance. Box plots of results, grouped by factors analyzed in ANOVA and HSD tests, are depicted in Figs. 5, 6, 7 and 8.
In Figs. 5 and 7, we see the performance of LightGBM or CatBoost trained on feature group 1 is similar to the performance of LightGBM or CatBoost trained on all   Tukey's HSD test yields a mean AUC of 0.96640 (a difference of 0.00641). Likewise, the mean F1-scores for CatBoost and LightGBM are similar for models trained on feature group 1 and all features. In this case the Tukey's HSD adjusted mean F1-score  Research Question Q1 Answer: Yes, our ensemble feature selection technique yields performance similar to, or better than, using all features. More specifically, the variant of our technique where 4 out of 7 classifiers agree on a feature is the criterion for feature selection that yields performance similar to, or better than, using all features.
Research question Q2: Does including the Destination_Port categorical feature significantly impact performance of LightGBM and CatBoost in terms of AUC and F1-score?
To answer research question Q2, we use results of experiments where classifier: Cat-Boost or LightGBM, is a factor, and the datasets' having, or not having the Destination_ Port feature is another factor. We perform ANOVA tests on the results of experiments grouped by these factors. The p-values associated with classifier and dataset factors for the ANOVA tests are both zero. Therefore, Tukey's HSD tests are appropriate. We report the results of those tests in Figs. 9, 10, 11 and 12.
It is interesting to note that the ranges of values of both AUC and F1-score are smaller when we use a dataset that includes destination port. So, not only do the ANOVA and HSD tests confirm that including Destination_Port is a significant factor in the performance of models for identifying attacks, but our results here also show greater stability in the values of results. These results enable us to answer our second research question.
Research question Q2 Answer: Yes, including the Destination_Port feature has a significant impact on performance in terms of AUC and F1-score.
Research question Q3: Does the choice of classifier: RF, DT, NB, LR, CatBoost, Light-GBM, or XGBoost, significantly impact performance in terms of AUC and F1-score? To answer the third research question, we note that all ANOVA tests we conduct show that classifier is a significant factor in experiments-the p-values associated with the classifier factor are 0. So, we perform Tukey's HSD tests to determine how classifiers may be grouped in terms of their performance. The groupings enable a  conclusion to be drawn on how the choice of classifier impacts the outcome of an experiment. We refer to groupings we report from conducting Tukey's HSD tests in Figs. 1, 3, 5, 7, 9, and 11. Taking a closer look at Figures 1 and 3, we see that our HSD test results indicate that some classifiers do not have significantly different performance. For example, in Fig. 1 we report that the classifier performance of LightGBM, RF, and XGBoost in terms of AUC is not significantly different. However, the same HSD test indicates seven classifiers fall into four distinct groups. In Figs. 5, 7, 9, and 11, the Tukey's HSD test results reported indicate the classifier is an important factor in the outcome of experiments, in terms of AUC or F1-Score. We also note that LightGBM is consistently in the group that the HSD test identifies with the best value of AUC or F1-Score in all cases.
Research question Q3 Answer: Yes, the choice of classifier significantly impacts performance.

Conclusion
The results in Tables 19 through 24, as well as the results from Tukey's HSD tests depicted in Figs. 1 through 11, and the answer to research question Q1 show that the feature selection technique that produces feature group 1A performs similar to or better than using all features. These results demonstrate that our ensemble feature selection technique should be used with classifiers to detect anomalies in CSE-CIC-IDS2018, since training a model with the reduced feature set consumes fewer computing resources.
We may also draw conclusions from the results of the ANOVA and Tukey's HSD tests to answer research questions Q2 and Q3. Test results for research question Q2 indicate that Destination_Port is a useful feature for classifiers. Hence, we conclude one should encode it for use with a classifier, if the classifier does not handle categorical features automatically. Test results for research question Q3 reveal that LightGBM performs Fig. 12 Box plots of F1-score grouped by whether the dataset contains the Destination Port feature; Tukey's HSD test indicates including Destination Port produces significantly different results similar to, or better than, any other classifier of CSE-CIC-IDS2018, even when we do not use Destination_Port as a feature for LightGBM.
Since our current study is limited to comparing CatBoost and LightGBM when we include Destination_Port as a categorical feature, we have an opportunity for future research to investigate whether another classifier might yield better performance in conjunction with a technique for encoding Destination_Port. There is also an opportunity to evaluate classifier performance with other network intrusion detection datasets. Another subject we have not broached here that deserves attention deals with techniques for addressing class imbalance, such as Random Undersampling (RUS) [61].