From: Cybersecurity data science: an overview from machine learning perspective
Dataset | Description |
---|---|
DARPA | Intrusion detection dataset that includes LLDOS 1.0 and LLDOS 2.0.2 attack scenario data. Data traffic and attacks containing in DARPA are collected by MIT Lincoln Laboratory for evaluating network intrusion detection systems [44, 49] |
KDD’99 Cup | Most widely used data set containing 41 features for evaluating anomaly detection methods, where attacks are categorized into four major classes, such as denial of service (DoS), remote-to-local (R2L), user-to-remote (U2R), and probing [50]. KDD’99 Cup dataset can be used to evaluate ML-based attack detection model |
NSL-KDD | A refined version of KDD’99 cup dataset where redundant records are eliminated. Thus ML classification based security model utilizing NSL-KDD dataset will not be biased towards more frequent records [51] |
CAIDA | The datasets CAIDA’07 and CAIDA’08 contain DDoS attack traffic and normal traffic traces [52, 53]. Thus CAIDA DDoS dataset can be used to evaluate ML-based DDoS attack detection model and inferring Internet Denial-of-Service activity |
ISOT’10 | A combination of malicious and non-malicious type of data traffic created by Information Security and Object Technology (ISOT) research at University of Victoria [54, 55]. To evaluate ML-based classification models ISOT datasets can be used |
ISCX’12 | The dataset contains 19 features and 19.11% of the traffic belongs to DDoS attacks. ISCX’12 was produced at the Canadian Institute for Cybersecurity [56, 57] and can be used to evaluate the effectiveness of ML-based network intrusion detection modeling |
CTU-13 | A labeled malware dataset including botnet, normal, and background traffic that was captured at CTU University, Czech Republic [58]. CTU-13 can be used for data-driven malware analysis using ML techniques and to evaluate the malware detection system |
UNSW-NB15 | The dataset has 49 features and nine different types of attacks including DoS that was created at the University of New South Wales in 2015 [59]. UNSW-NB15 can be used for evaluating ML-based anomaly detection system in cyber applications. |
CIC-IDS2018 CIC-IDS2017 | The datasets include different attack scenarios, namely Brute-force, Heartbleed, Botnet, HTTP DoS, DDoS, Web attacks, and insider attack, collected by the Canadian Institute for Cybersecurity [60]. Datasets can be used for evaluating ML based intrusion detection systems including Zero-Day attacks |
CIC-DDoS2019 | A dataset containing DDoS attacks was collected by the Canadian Institute for Cybersecurity [61]. CIC-DDoS can be used for network traffic behavioral analytics to detect DDoS attacks using ML techniques |
MAWI | A collection of Japanese network research institutions and academic institutions used to detect and evaluate DDoS intrusions using ML techniques [62] |
ADFA IDS | An intrusion dataset with different versions named ADFA-LD and ADFA-WD issued by the Australian Defence Academy (ADFA) [63]. They are designed for evaluation by host-based IDS |
CERT | The dataset includes users’ activity logs that was created for the purpose of validating insider-threat detection systems [64, 65]. This can be used to analyze ML based user behavioral activities |
Email datasets are difficult to obtain because of privacy concerns. Some common corpora of emails include EnronSpam [66], SpamAssassin [67], and LingSpam [68] | |
DGA | The Alexa Top Sites dataset is generally used as a source of benign domain names [69]. The malicious domain names are obtained from OSINT [70] and DGArchive [71]. DGA dataset can be used for experiments in ML-based automatic DGA domains classification or botnet detection [72] |
Malware | Several malware datasets such as Genome project [73], Virus Share [74], VirusTotal [75], Comodo [76], Contagio [77], DREBIN [78], and Microsoft [79] contain malicious files. These datasets can be used for data-driven malware analysis using ML techniques and to evaluate malware detection system |
Bot-IoT | A dataset that incorporates legitimate and simulated IoT network traffic, along with different attacks for network forensic analytics in the area of Internet of Things [80]. Bot-IoT can be used to evaluate the reliability using different statistical and machine learning methods for forensics purposes |