IoT information theft prediction using ensemble feature selection

Leevy, Joffrey L.; Hancock, John; Khoshgoftaar, Taghi M.; Peterson, Jared M.

doi:10.1186/s40537-021-00558-z

Research
Open access
Published: 06 January 2022

IoT information theft prediction using ensemble feature selection

Joffrey L. Leevy ORCID: orcid.org/0000-0002-7079-7540¹,
John Hancock¹,
Taghi M. Khoshgoftaar¹ &
…
Jared M. Peterson¹

Journal of Big Data volume 9, Article number: 6 (2022) Cite this article

4245 Accesses
10 Citations
Metrics details

Abstract

The recent years have seen a proliferation of Internet of Things (IoT) devices and an associated security risk from an increasing volume of malicious traffic worldwide. For this reason, datasets such as Bot-IoT were created to train machine learning classifiers to identify attack traffic in IoT networks. In this study, we build predictive models with Bot-IoT to detect attacks represented by dataset instances from the Information Theft category, as well as dataset instances from the data exfiltration and keylogging subcategories. Our contribution is centered on the evaluation of ensemble feature selection techniques (FSTs) on classification performance for these specific attack instances. A group or ensemble of FSTs will often perform better than the best individual technique. The classifiers that we use are a diverse set of four ensemble learners (Light GBM, CatBoost, XGBoost, and random forest (RF)) and four non-ensemble learners (logistic regression (LR), decision tree (DT), Naive Bayes (NB), and a multi-layer perceptron (MLP)). The metrics used for evaluating classification performance are area under the receiver operating characteristic curve (AUC) and Area Under the precision-recall curve (AUPRC). For the most part, we determined that our ensemble FSTs do not affect classification performance but are beneficial because feature reduction eases computational burden and provides insight through improved data visualization.

Introduction

The IoT is a network of physical objects with limited computing capability [1]. In recent years, there has been rapid growth in the use of these smart devices, as well as an increasing security risk from malicious network traffic. Several datasets have been created for the purpose of training machine learning classifiers to identify attack traffic. One of the more recent datasets for network intrusion detection is Bot-IoT [2].

The Bot-IoT dataset contains instances of various attack categories: denial-of-service (DoS), distributed denial-of-service (DDoS), reconnaissance, and information theft. The processed full dataset was generated by the Argus network security tool [3] and is available from an online repository of several comma-separated values (CSV) files. Bot-IoT has 29 features and 73,370,443 instances. Table 1 shows the categories and subcategories of IoT network traffic for the full dataset.

Table 1 Bot-IoT: full set

IoT information theft prediction using ensemble feature selection

Abstract

Introduction

Bot-Iot developmental environment

Related works

Methodology

Data cleaning

Data transformations

Data sampling

Ensemble feature selection

Classifiers and performance metrics

Parameters and cross validation

High-level methodology overview

Results and discussion

Information theft

Information theft without ensemble feature selection

Information theft with ensemble feature selection

Information theft statistical analysis

Information theft conclusion

Data exfiltration

Data exfiltration without ensemble feature selection

Data exfiltration with ensemble feature selection

Data exfiltration statistical analysis

Data exfiltration conclusion

Keylogging

Keylogging without ensemble feature selection

Keylogging with ensemble feature selection

Keylogging statistical analysis

Keylogging conclusion

Conclusion

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Appendices

Appendix A

Appendix B

Feature selection for information theft

Feature selection for data exfiltration

Feature selection for keylogging

Appendix C

Information theft hyperparameters

Information theft without ensemble feature selection

Information theft with ensemble feature selection

Data exfiltration hyperparameters

Data exfiltration without ensemble feature selection

Data exfiltration with ensemble feature selection

Keylogging hyperparameters

Keylogging without ensemble feature selection

Keylogging with ensemble feature selection

Rights and permissions

About this article

Cite this article

Share this article

Keywords