A new feature popularity framework for detecting cyberattacks using popular features

Zuech, Richard; Hancock, John; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-022-00661-9

Research
Open access
Published: 15 December 2022

A new feature popularity framework for detecting cyberattacks using popular features

Richard Zuech¹,
John Hancock¹ &
Taghi M. Khoshgoftaar¹

Journal of Big Data volume 9, Article number: 119 (2022) Cite this article

2115 Accesses
2 Citations
Metrics details

Abstract

We propose a novel feature popularity framework, and introduce this new framework to the cybersecurity domain. Feature popularity has not yet been used in machine learning or data mining, and we implement it with three web attacks from the CSE-CIC-IDS2018 dataset: Brute Force, SQL Injection, and XSS web attacks. Feature popularity is based upon ensemble Feature Selection Techniques (FSTs) and allows us to more easily understand common and important features between different cyberattacks. Three filter-based and four supervised learning-based FSTs are used to generate feature subsets for each of our three different web attack datasets, and then our feature popularity frameworks are applied. Classification performance for feature popularity is mostly similar as compared to when “all features” are evaluated (with feature popularity subsets having better performance in 5 out of 15 experiments). Our feature popularity technique effectively builds an ensemble of ensembles by first building an ensemble of FSTs for each dataset, and then building another ensemble across a dataset agreement dimension. The Jaccard similarity is also employed with our feature popularity framework in order to better identify which attack classes should (or should not) be grouped together when applying feature popularity. The four most popular features across all three web attacks from this experiment are: Flow_Bytes_s, Flow_IAT_Max, Fwd_IAT_Std, and Fwd_IAT_Total. When only using these four features as input to our models, classification performance is not seriously degraded. This feature popularity framework granted us new and previously unseen insights into the web attack detection process with CSE-CIC-IDS2018 big data, even though we had intensely studied it previously. We realized these four particular features cannot properly identify our three web attacks, as they operate mainly from the time dimension and NetFlow features from layers 3 and 4 of the OSI model. Conversely, our three web attacks operate in the application layer (7) of the OSI model and should not leave signatures in these four features. Feature popularity produces easier to explain models which provide domain experts better visibility into the problem, and can also reduce the complexity of implementing models in real-world systems.

Introduction

With consumers spending over $600 billion on e-commerce in the United States during 2019 [1], cybersecurity is becoming increasingly important to help defend against attackers. Machine learning can be employed to help in detecting cyberattacks. Feature selection is a common technique used by machine learning practitioners. Benefits of feature selection include improving classification efficiency by training models with fewer features which requires less computing resources, and feature selection can sometimes even improve classification performance.

Another benefit of feature selection, in the context of cybersecurity, is it can help practitioners better understand the attack detection process. This can be accomplished because feature selection can identify the most important features during the model building process. Gaining a better understanding of the most important features not only helps during the machine learning model building process, but it can be even more helpful when machine learning models are deployed and implemented into real-world products and systems.

Various Feature Selection Techniques (FSTs) can generate very different feature lists on the same dataset (which we identify throughout this study). However, finding “common features” between different FSTs can sometimes help us find an even better feature subset through the diversity of different FSTs acting in an ensemble [2, 3]. While ensemble FSTs can sometimes improve classification performance, they might still be desirable even with minor degradations in classification performance. Reasons for using ensemble FSTs with minor decreases in performance may include: feature stability [4], reducing model complexity for real-world implementations, and simply providing models which are easier to understand and are more explainable.

Feature similarity is a concept which is implicit in the ensemble FST process. For example, an ensemble FST can find similar (or “common”) features between the different FSTs by identifying features which appear in common among the “Top N” Feature Importance Lists from different FSTs. However, we extend this concept and introduce the notion of “feature popularity” [5] by also finding common features across different datasets. For example, in cybersecurity there are different types of cyberattacks and we can generate different datasets which are based upon those different attacks.

To explore feature popularity, we utilize the CSE-CIC-IDS2018 dataset which was created by Sharafaldin et al. [6]. CSE-CIC-IDS2018 is a more recent version of the popular CIC-IDS2017 dataset [7], which was also created by Sharafaldin et al. The CSE-CIC-IDS2018 dataset includes over 16 million instances which includes normal instances, as well as the following family of attacks: web attack, Denial of Service (DoS), Distributed Denial of Service (DDoS), brute force, infiltration, and botnet.

The CSE-CIC-IDS2018 dataset is big data, as it contains over 16 million instances. While big data has not been formally defined in terms of the number of instances, one study [8] considers only 100,000 instances to be big data. Other studies [9, 10] have considered 1,000,000 instances to be big data. Since CSE-CIC-IDS2018 is more than 1,000,000 instances, we consider it to be big data as well.

Given its richness in containing many different attack labels, CSE-CIC-IDS2018 is a good dataset for investigating feature popularity. To do so, we only evaluate the following three different web attacks from CSE-CIC-IDS2018: Brute Force (BF), Cross-Site Scripting (XSS), and SQL Injection (which we commonly refer to only as “SQL” throughout this document). Basically, we generate three new datasets by combining each of these three attack labels with all of the normal traffic from the full CSE-CIC-IDS2018 dataset. We then compare feature popularity results between these three different web attack datasets.

Brute Force web attacks correspond to brute force login attacks targeting web pages. Next, the XSS web attack refers to where attackers inject malicious client-side scripts into susceptible web pages targeting web users which view those pages. Finally, the SQL Injection web attack represents a code injection technique where attackers craft special sequences of characters and submit them to web page forms in an attempt to directly query the back-end database of that website. The feature popularity techniques which we introduce to CSE-CIC-IDS2018 in this study allows us to visually explain and quantify common features across these three different web attacks.

We selected web attacks to implement feature popularity because they are important to cybersecurity practitioners and they still commonly appear in the Open Web Application Security Project (OWASP) “Top 10 Web Application Security Risks" [11]. Also, the web attacks from the CSE-CIC-IDS2018 have three different web attack labels, and this allows us to partition the datasets into three new datasets in order to implement feature popularity. In other words, with feature popularity we can find the most popular features across these three different web attacks by applying feature popularity to these three newly created datasets.

The remaining sections of this paper are organized as follows. The "Related work" section studies existing literature for feature popularity with CSE-CIC-IDS2018 data. Then, the "Methodologies" section describes the data preparation, classifiers, FSTs, performance metrics, and feature popularity techniques applied in our experiments. Next, we provide a walk-through example of feature popularity in the Creating Feature Popularity Lists with Web Attack Datasets section. The "Results and discussion" section provides our results and analysis. Finally, the "Conclusion" section concludes our work.

Related work

Sarhan et al. [12] focus on how to explain models with feature selection and the eX-plainable Artificial Intelligence (XAI) method using the CSE-CIC-IDS2018 dataset. Their motivation in using this XAI method is to be able to more easily explain the attack detection process through a better understanding of what the most important features are after applying the feature selection process. To score the most important features, they assign a Shapley value to each of the features. The Shapley value “is the weighted average of the respective contribution of a feature value” (which is essentially the amount a feature contributes towards making a prediction).

Random Forest and Deep Feed Forward classifiers are employed by Sarhan et al. to score the top 20 features of CSE-CIC-IDS2018 with Shapley values. These two different classifiers produce two very different lists of top 20 features for each of the classifiers. For example, the top ranked feature from the Deep Feed Forward classifier is Bwd_Packets_s, while the Random Forest classifier ranks this same feature as the 16th best feature overall. It is difficult to ascertain how similar the two feature subsets are between the two different classifiers. Their research does not compare the feature similarity between these two different feature subsets like our research does. They do not benchmark the classification performance of only using the top 20 features versus all of the features, but their main motivation is to better explain and interpret the classification models by understanding the most important features used to generate those models.

Leevy et al. [13] apply an ensemble feature selection technique to the full CSE-CIC-IDS2018 dataset, and employ binary classification by merging the multiple attacks to one attack label. The ensemble feature selection technique considers seven different FSTs, of which three are filter-based FSTs and four are supervised FSTs. Different feature subsets are generated by finding common features among the seven different FSTs where a certain number of the FSTs agree. This ensemble FST concept is similar to this current study, but the main difference is that study only considers one attack dataset while this current study extends the approach by also finding common features across multiple attack datasets. In other words, this current study is different as it not only finds common features for a single attack dataset, but it also finds common features across multiple attacks. Also, the current study introduces the Jaccard similarity for quantifying feature subset similarity between different attacks.

Fitni et al. [14] compare two different feature selection techniques with the full CSE-CIC-IDS2018 dataset and map the multiple attacks to a binary classification problem with only attack and normal labels. Their two feature selection techniques are Chi-Squared (top 22 features) and Spearman’s rank correlation coefficient (top 23 features), and they compare these two FSTs with Logistic Regression and Decision Tree classifiers. The Spearman’s rank correlation coefficient technique performed better with F1 scores of 0.983 and 0.974, as compared to the Chi-Squared results of F1 scores of 0.791 and 0.974. Also, the full feature set performed best with the Decision Tree yielding an Area Under the Receiver Operating Character-istic Curve (AUC) score of 0.975.

Based on the results of these two classifiers (Logistic Regression and Decision Tree), Fitni et al. performed further experimentation using seven classifiers with only the Spearman’s rank correlation coefficient FST and the full feature set. They do a nice job displaying their feature subsets which provides good insight into the attack detection process (similar to the XAI motivations of Sarhan et al. [12]). However, their research does not consider feature popularity concepts like our study does.

Beechey et al. [15] apply feature selection to the Goldeneye and Slowloris Denial of Services attacks from CSE-CIC-IDS2018. Their dataset only considers one day out of the ten days of available network traffic from CSE-CIC-IDS2018 with 1,048,575 instances, while our experiment considers all ten days of normal traffic encompassing over 13 million instances. Eight feature selection techniques were employed with at least six classifiers, and their Table 6 indicates perfect AUC scores for several combinations of FSTs and classifiers (sometimes overfitting can be associated with perfect classification). While Beechey et al. and others [16] apply feature ranking techniques to CSE-CIC-IDS2018, they do not explore the notion of feature popularity between different attacks or common features between different FSTs.

We thoroughly surveyed Google Scholar to find related works to CSE-CIC-IDS2018 and our feature popularity research, and we searched for terms like “feature popularity”, “CSE-CIC-IDS2018 feature similarity”, and “CSE-CIC-IDS2018 feature selection”. First, Google Scholar did not provide any results significantly related to our “feature popularity” focus from any application domain, and so we are the first to conceive the feature popularity concept. Second, after reviewing more than 261 CSE-CIC-IDS2018 works at the time of this writing, only the works by Sarhan et al. [12] and Leevy et al. [13] had aspects which were remotely similar to our feature popularity research. While the rest of the CSE-CIC-IDS2018 corpus did contain some feature selection aspects, we only included [14, 15] as those had the most compelling details of feature ranking with CSE-CIC-IDS2018.

The XAI method highlighted by Sarhan et al. provides good insights, and we agree that a better understanding and explanation of feature subsets is important to the attack detection process when implementing machine learning models in the real world. Moreover, this is especially important when considering different attacks like we have done, as different attacks can generate very different feature subsets. To the best of our knowledge, ours is the first study to define the concept of feature popularity, especially in the context of network intrusion detection. In addition, it is the first study to utilize the Jaccard similarity metric for examining feature similarity between different attack types.

Methodologies

Data preparation

In this section, we describe how we prepared and cleaned the dataset files used in our experiments. Properly documenting these steps is important in being able to reproduce experiments.

We dropped the “Protocol” and “Timestamp” fields from CSE-CIC-IDS2018 during our preprocessing steps. The “Protocol” field is somewhat redundant as the “Dst Port” (Destination_Port) field mostly contains equivalent “Protocol” values for each Destination_Port value. Additionally, we dropped the “Timestamp” field as we wanted the learners not to discriminate attack predictions based on time, especially with more stealthy attacks in mind. In other words, the learners should be able to discriminate attacks regardless of whether the attacks are high volume or slow and stealthy. Dropping the “Timestamp” field also allows us the convenience of combining or dividing the datasets into ways more compatible with our experimental frameworks. Additionally, a total of 59 records were dropped from CSE-CIC-IDS2018 due to header rows being repeated in certain days of the datasets. These duplicates were easily found and removed by filtering records based on a white list of valid label values.

The fourth downloaded file named “Thuesday-20-02-2018_TrafficForML_CICFlowMeter.csv” was different than the other nine files from CSE-CIC-IDS2018. This file contained four extra columns: “Flow ID”, “Src IP”, “Src Port”, and “Dst IP”. We dropped these four additional fields. Also of note is that this one particular file contained nearly half of all the records for CSE-CIC-IDS2018. This fourth file contained 7,948,748 records of the dataset’s total 16,232,943 records.

Certain fields contained negative values which are invalid values and so we dropped those instances with negative values for the “Fwd_Header_Length”, “Flow_Duration”, and “Flow_IAT_Min” fields (with a total of 15 records dropped from CSE-CIC-IDS2018 for these fields containing negative values). Negative values in these fields were causing extreme values that can skew classifiers which are sensitive to outliers.

Eight fields contained constant values of zero for every instance. In other words, these fields did not contain any value other than zero. Before starting machine learning, we filtered out the following list of fields (which all had values of zero):

1
Bwd_PSH_Flags
2
Bwd_URG_Flags
3
Fwd_Avg_Bytes_Bulk
4
Fwd_Avg_Packets_Bulk
5
Fwd_Avg_Bulk_Rate
6
Bwd_Avg_Bytes_Bulk
7
Bwd_Avg_Packets_Bulk
8
Bwd_Avg_Bulk_Rate

We also excluded the “Init_Win_bytes_forward” and “Init_Win_bytes_backward” fields because they contained negative values. These fields were excluded since about half of the total instances contained negative values for these two fields (so we would have removed a very large portion of the dataset by filtering all these instances out). Similarly, we did not use the “Flow_Duration” field as some of those values were unreasonably low with zero values.

The “Flow Bytes/s” and “Flow Packets/s” fields contained some “Infinity” and “NaN” values (with less than 0.6% of the records containing these values). We dropped these instances where either “Flow Bytes/s” or “Flow Packets/s” contained “Infinity” or “NaN” values. Upon carefully and manually inspecting the entire CSE-CIC-IDS2018 dataset for such values, there was too much uncertainty as to whether they were valid records or not. As sorted from minimum to maximum on these fields, neighboring records were very different where “Infinity” was found. Similar to Zhang et al. [17], we did attempt to impute values for these columns by taking the maximum value of the column and adding one. In the end, we abandoned this imputation approach and dropped 95,760 records from CSE-CIC-IDS2018 for records containing any “Infinity” or “NaN” values.

Table 1 Entire CSE-CIC-IDS2018 Dataset by files/days (only web attacks and normal traffic are used in our experiments)

A new feature popularity framework for detecting cyberattacks using popular features

Abstract

Introduction

Related work

Methodologies

Data preparation

Classifiers

Feature selection techniques and rankers

Supervised learning-based FST rankers

Filter-based FST rankers

Classification performance metrics

Feature popularity metrics

New feature popularity framework

Creating feature popularity lists with web attack datasets

Create feature importance lists (Step 1)

Create FST agreement lists (Step 2)

Create feature popularity lists (Step 3)

Results and discussion

Dataset similarity

Feature popularity performance

Cybersecurity analysis and insights

Conclusion

Availability of data and materials

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords