Cybersecurity data science: an overview from machine learning perspective

Sarker, Iqbal H.; Kayes, A. S. M.; Badsha, Shahriar; Alqahtani, Hamed; Watters, Paul; Ng, Alex

doi:10.1186/s40537-020-00318-5

Journal of Big Data

Table 2 A summary of cybersecurity datasets highlighting diverse attack-types and machine learning-based usage in different cyber applications

From: Cybersecurity data science: an overview from machine learning perspective

Dataset	Description
DARPA	Intrusion detection dataset that includes LLDOS 1.0 and LLDOS 2.0.2 attack scenario data. Data traffic and attacks containing in DARPA are collected by MIT Lincoln Laboratory for evaluating network intrusion detection systems [44, 49]
KDD’99 Cup	Most widely used data set containing 41 features for evaluating anomaly detection methods, where attacks are categorized into four major classes, such as denial of service (DoS), remote-to-local (R2L), user-to-remote (U2R), and probing [50]. KDD’99 Cup dataset can be used to evaluate ML-based attack detection model
NSL-KDD	A refined version of KDD’99 cup dataset where redundant records are eliminated. Thus ML classification based security model utilizing NSL-KDD dataset will not be biased towards more frequent records [51]
CAIDA	The datasets CAIDA’07 and CAIDA’08 contain DDoS attack traffic and normal traffic traces [52, 53]. Thus CAIDA DDoS dataset can be used to evaluate ML-based DDoS attack detection model and inferring Internet Denial-of-Service activity
ISOT’10	A combination of malicious and non-malicious type of data traffic created by Information Security and Object Technology (ISOT) research at University of Victoria [54, 55]. To evaluate ML-based classification models ISOT datasets can be used
ISCX’12	The dataset contains 19 features and 19.11% of the traffic belongs to DDoS attacks. ISCX’12 was produced at the Canadian Institute for Cybersecurity [56, 57] and can be used to evaluate the effectiveness of ML-based network intrusion detection modeling
CTU-13	A labeled malware dataset including botnet, normal, and background traffic that was captured at CTU University, Czech Republic [58]. CTU-13 can be used for data-driven malware analysis using ML techniques and to evaluate the malware detection system
UNSW-NB15	The dataset has 49 features and nine different types of attacks including DoS that was created at the University of New South Wales in 2015 [59]. UNSW-NB15 can be used for evaluating ML-based anomaly detection system in cyber applications.
CIC-IDS2018 CIC-IDS2017	The datasets include different attack scenarios, namely Brute-force, Heartbleed, Botnet, HTTP DoS, DDoS, Web attacks, and insider attack, collected by the Canadian Institute for Cybersecurity [60]. Datasets can be used for evaluating ML based intrusion detection systems including Zero-Day attacks
CIC-DDoS2019	A dataset containing DDoS attacks was collected by the Canadian Institute for Cybersecurity [61]. CIC-DDoS can be used for network traffic behavioral analytics to detect DDoS attacks using ML techniques
MAWI	A collection of Japanese network research institutions and academic institutions used to detect and evaluate DDoS intrusions using ML techniques [62]
ADFA IDS	An intrusion dataset with different versions named ADFA-LD and ADFA-WD issued by the Australian Defence Academy (ADFA) [63]. They are designed for evaluation by host-based IDS
CERT	The dataset includes users’ activity logs that was created for the purpose of validating insider-threat detection systems [64, 65]. This can be used to analyze ML based user behavioral activities
Email	Email datasets are difficult to obtain because of privacy concerns. Some common corpora of emails include EnronSpam [66], SpamAssassin [67], and LingSpam [68]
DGA	The Alexa Top Sites dataset is generally used as a source of benign domain names [69]. The malicious domain names are obtained from OSINT [70] and DGArchive [71]. DGA dataset can be used for experiments in ML-based automatic DGA domains classification or botnet detection [72]
Malware	Several malware datasets such as Genome project [73], Virus Share [74], VirusTotal [75], Comodo [76], Contagio [77], DREBIN [78], and Microsoft [79] contain malicious files. These datasets can be used for data-driven malware analysis using ML techniques and to evaluate malware detection system
Bot-IoT	A dataset that incorporates legitimate and simulated IoT network traffic, along with different attacks for network forensic analytics in the area of Internet of Things [80]. Bot-IoT can be used to evaluate the reliability using different statistical and machine learning methods for forensics purposes

Back to article page