Apply machine learning techniques to detect malicious network traffic in cloud computing

Alshammari, Amirah; Aldribi, Abdulaziz

doi:10.1186/s40537-021-00475-1

Research
Open access
Published: 14 June 2021

Apply machine learning techniques to detect malicious network traffic in cloud computing

Journal of Big Data volume 8, Article number: 90 (2021) Cite this article

22k Accesses
33 Citations
1 Altmetric
Metrics details

Abstract

Computer networks target several kinds of attacks every hour and day; they evolved to make significant risks. They pass new attacks and trends; these attacks target every open port available on the network. Several tools are designed for this purpose, such as mapping networks and vulnerabilities scanning. Recently, machine learning (ML) is a widespread technique offered to feed the Intrusion Detection System (IDS) to detect malicious network traffic. The core of ML models’ detection efficiency relies on the dataset’s quality to train the model. This research proposes a detection framework with an ML model for feeding IDS to detect network traffic anomalies. This detection model uses a dataset constructed from malicious and normal traffic. This research’s significant challenges are the extracted features used to train the ML model about various attacks to distinguish whether it is an anomaly or regular traffic. The dataset ISOT-CID network traffic part uses for the training ML model. We added some significant column features, and we approved that feature supports the ML model in the training phase. The ISOT-CID dataset traffic part contains two types of features, the first extracted from network traffic flow, and the others computed in specific interval time. We also presented a novel column feature added to the dataset and approved that it increases the detection quality. This feature is depending on the rambling packet payload length in the traffic flow. Our presented results and experiment produced by this research are significant and encourage other researchers and us to expand the work as future work.

Introduction

The last two years have seen some of the most shared and stark cybersecurity attacks regularly recorded toward networks in different industries. Security specialists expect another record-breaking year of network breaches and data security risks; companies must make themselves aware of the latest threats in circulation to ensure their security countermeasures are up to par. Ninth attacks type are the most significant frequency in the security First report in Garg et al. [9]

In network attacks, the attacker must know active addresses, network topology, and available services. Network scanners can identify open ports on a system, whether TCP or UDP ports, where shared services are related to specific ports, and an attacker could send packets to every port [6]. TCP fingerprinting abilities of how systems react to unauthorized packet formats different vendors TCP/IP stacks answer differently to unauthorized packets. So, the attacker can determine OS by sending numerous combinations of illegal packet options, initiating a connection with an RST packet, or combining other odd and illegal TCP code bits. The attacker could know if a machine is running, whether Linux, Windows, or any other operating system. This information helps to refine the attack and search for weaknesses in specific services and systems to access [25]. For example, DoS attack, the attacker remains network buffers or memory resources in over-busy. They send massive traffic to a system on the network overcoming its capability to respond to legitimate users. The attacker does this by flood systems with ICMP and UDP packets. The most popular packet flood attack takes benefit of the weakness in TCP’s three-way handshake. Exhaust a server’s ability by leaving half-open connections, so it consumes bandwidth; an attacker would require a more significant relationship than the victim to cut all service [5]. The network must protected from such attacks; robust IDS should deploy before network routers from the company side.

Recently, ML techniques were used to train IDS to capture malicious network traffic. The main idea of IDS based on ML analysis is finding patterns and building an IDS based on the dataset. The IDS can detect adequately. We need to have a real network traffic dataset and proper feature selection to learned enough. Therefore, we aim to propose a detection framework with an ML model to detect malicious traffic rely on a dataset consisting of network traffic attributes to feed IDS, as illustrated in Fig. 1. The dataset called ISOT-CID was created by Aldribi et al. [2] and described in detail in the methodology. The presented model is prepared, constructed, fitted, and evaluated by python language using Sklearn, Numpy, Matplotlib, and Pandas. Our attractive model should construct and fit in memory, so it listens to the extracted feature from network traffic to predict anomalies in real-time. The contribution of our study consists of five things:

1.
Extracting network features (Calculated): T-IN, T-OUT, APL, PV, TBP, and novel Rambling can help IDS better detect. These six features added to the dataset are significant to produce a qualitative dataset applicable to the train machine learning model for anomaly detection.
2.
Propose a lightweight ML model so it can feed IDS in real-time.
3.
Evaluating how calculated features would provide the best classification accuracy using the cross-validation method and split validation.
4.
Our model is applicable to be placed on a local network or before the internet router from the company side.
5.
Detect whether anomaly or normal traffic.

The remainder of this paper organizes as follows. “Related work” section presents related works, and similar studies are listed. “Detection framework (Our Approach)” section illustrates our framework as a complete solution for detection anomaly, including the machine learning model trained by dataset constructed from network row traffic data. The methodology and experimental results are illustrated in “Methods” and “Results and analysis” sections, respectively. Finally, the discussion and the conclusion are presented in “Discussion” and “Conclusions and future work” sections, respectively.

Related work

As anomaly detection is most inserting as a researcher issue, there are many explorations and examination efforts in this field. Briefly, we write about significant of them as related works categorized about the kind of proposed solution.

Supervised learning

Parul and Gurjwar [23] used the Decision Tree algorithms classifier to train the IDS in a layered approach. The result of this approach gave a good result in a layered approach used for each layer. They used the Random forest algorithm and gave good results for every layer but have limited U2R attach, which presents a very low-rate classification. The author argues to modify the random forest to improve the result of the U2R layer. The proposed system used the KDDcup99 dataset, which has significant enhancement on the new release of the dataset call NSL_KDD.

Peng et al. [24] presented an IDS based on the decision tree classifier algorithm. The authors compared the result of the work by multi-methods were not only 10% of the dataset; the entire dataset was tested. The experiment results showed that the proposed IDS system was effective. However, when comparing the detection time for each method, the decision tree’s time was not the best in the case of guaranteed accuracy. The authors argue that the proposed IDS system can be used in fog computing environments over big data. The proposed system was not tested as a real-time application. The system also used the old version KDD cup 99, a new, recent version with significant development.

The presented paper for Anton et al. [3] shown that some ML anomaly detection algorithms such as SVM and Random Forest achieved well in detecting network traffic anomalies in business networks, where both of them are classifier techniques. The dataset needed for training these models delivered by simulators [14]. The trouble lies in producing sound, actual data that matches the business environment where an anomaly detection model can be applied. There are many opportunities for the allowance of the proposed methods. Data from various resources can be collected, composed, and utilized to increase performance. The overview of context information into the anomaly detection process was capable and encouraged the increase of accuracy.

Additionally, the engagement of trickery technologies as devices for anomaly detection could improve the vision of anomaly behavior. One of the essential dominant requirements is capturing data by attacks exact to business applications in general. The analysis achieved in this work only employs network-based features which, in the same form, residence in home and office devices. The only main diversion was the timing pattern that is strongly interrelated to attacks.

Manna and Alkasassbeh [15] presented a recent approach that used ML, such as decision tree J48, random forest, and REP tree. The proposed technique used SNMP-MIB data for the trained IDS system to detect DOS attack anomalies that may affect the network. The classifiers and attributes were applied to the IP group. The results showed that applying the REP tree algorithm classifier donated the highest performance to all IP set times. The average performance of these three classifiers was accurate enough to be an IDS System. However, it has a limitation that the dataset is extensive and needs more challenges to be used in real-time.

Unsupervised learning

Jianliang et al. [11] proposed applying the K-means clustering algorithm used as ML in intrusion detection. K-means was used for intrusion detection to detect anomalies traffic and divide ample data space efficiently, but it has many drawbacks in cluster dependence. They constructed the intrusion detection model using the k-Medoid clustering algorithm with positive modifications. The algorithm stated selecting initial K-Medoid and verified it to be better than K-means for intrusion detection of an anomaly. The proposed approach has exciting advantages over the existing algorithm, which mostly overwhelms the drawbacks of dependency on primary centroids, dependency on the number of clusters, and unrelated clusters. The proposed algorithm is needed to investigate the detection rate for the root attack and real-time environment.

Qiu et al. [26] presented GAD as a group anomaly detection scheme to pinpoint the subgroup of samples and a subgroup of features that together identify an anomalous cluster. The system was applied in network intrusion detection to detect Botnet and peer-to-peer flow clusters. The approach intended to capture and exploit statistical dependencies that might remain among the measured features. The experiments of the model on real-world network traffic data showed the advantage of the proposed system.

A novel Network Data Mining approach was proposed by Kumari et al. [12]. Their approach uses the K-means clustering technique to feature datasets that are extracted from flow instances. Training data divided into clusters of periods of anomalies and regular flow. While the data mining process was moderately complex, the resulting centroids of clusters are used to detect anomalies in new live observing data with a small number of distance calculations. This approach allows arranging the detection method for accessible real-time detection as part of the IDS system. Applying the clustering technique separately for different services identified by their transport protocol and port number enhances detection accuracy. The presented approach conducted an experiment using generated and actual flow. As the author said, this approach needs several improvements, such as comparing clustering results with different K to determine the optimal number of clusters, considering other features such as the average flow duration, and considering different distance metrics.

Nikiforov [20] used a Cluster-based technique to detect anomalies for Virtual Machines within both production and testing LAB environments with reasonable confidence. Some improvements need to be made to have even welled results in testing environments. This model does not consider the time of day and day of week dependability of the VM load. For example, the night is usually a busy time since many auto-tests were running during the night in the testing infrastructure. Some tests were being run at the same time every day. Based on this, the following improvements in the model might be made. Analyze a detected outlier based on the same time as it was detected but for several days before. Check if this is a case when a load is scheduled and planned. Divide the metrics used for analysis into business days vs. weekends since the load might differ.

Cloud-based techniques

Mobilio et al. [17] presented Cloud-based anomaly detection as a service that used the as-a-service paradigm exploited in cloud systems to announce the anomaly detection logic’s control. They also proposed early results with lightweight detectors displaying a promising solution to better control anomaly detection logic. They also discussed how to apply the as-a-service paradigm to the anomaly detection logic and achieving anomaly detection as-a-service. They also proposed an architecture that supports the as-a-service paradigm and can work jointly with any observing system that stores data in time-series databases. The early experimentation of as-a-service with the Clearwater cloud system obtained results demonstrating how the as-a-service paradigm can effectively handle the anomaly detection logic. This approach is fascinating, which integrates new technology of as-a-service in anomaly detection in real-time.

Moustafa et al. [18] proposed a Collaborative Anomaly Detection Framework named CADF for handling big data in cloud computing systems. They provided the technical functions and the way of deployment of this framework for these environments. The proposed approach comprises three modules: capturing and logging network data, preprocessing these data, and a new Decision Engine using a Gaussian Mixture Model [10] and lower–upper Interquartile Range threshold [16] for detecting attacks. The UNSW-NB15 dataset was used for evaluating the new Decision Engine to assess its reliability while deploying the model in real cloud computing systems, and it compared with three ADS techniques. The architecture for deploying this mode as Software as a Service (SaaS) was produced to be installed easily in cloud computing systems.

An ensemble-based multi-filter feature selection method is proposed by Osanaiye et al. [22]. This method achieves an optimum selection by integrating the output of four filter methods. The proposed approach is deployed in cloud computing and used for detecting DDOS attacks. An extensive experimental evaluation of the proposed method was accomplished using the intrusion detection benchmark dataset, NSL-KDD, and decision tree classifier. The obtained result shows that the proposed method decreases the number of features to 13 instead of 41 efficiently. Besides, it has a high detection rate and classification accuracy when compared to other classification techniques.

Barbhuiya et al. [4] presented Real-time ADS named RADS.RADS addresses detecting the anomaly using a single-class classification model and a window-based time series analysis. They evaluated the performance of RADS by running lab-based and real-world experiments. The lab-based experiments were performed in an OpenStack-based Cloud data center, which hosts two representatives, Cloud Applications Graph Analytics and Media Streaming, collected from the CloudSuite workload collection. In contrast, the real-world experiments carried out on the real-world workload traces collected from a Cloud data center named Bitbrains. The evaluation results demonstrated that RADS could achieve 90–95% accuracy with a low false-positive rate of 0–3% while detecting DDoS and crypto-mining attacks in real-time. The result showed that RADS experiences fewer false positives while using the proposed window-based time series analysis than entropy-based analysis. They evaluated the performance of RADS in conducting the training and the testing in real-time in a lab-based Cloud data center while hosting varying 2 to 10 of VMs. The evaluation results suggest that RADS can be used as a lightweight tool to consume minimal hosting node CPU and processing time in a Cloud data center.

Zhang [28] presented Multi-view learning techniques for detecting the cloud computing platform’s anomaly by implementing the extensible ML model. They worked on a gap formulated as the pair classification in real- time, which is trained by improving the ELM model’s multiple features.

The presented technique automatically fuses multiple features from different sub-systems and attains the improved classification solution by reducing the training mistakes. Sum ranked anomalies are identified by the relation between samples and the classification boundary, and weighting samples ranked retrain the classification model. The proposed model deals with different challenges in detecting an anomaly, such as imbalance spreading, high dimensional features, and others, efficiently via Multi-view learning and feed regulating.

Deep learning techniques

Fernandez and Xu [8] presented a case study using a Deep learning network to detect anomalies. The author said that he achieved excellent results in supervised network intrusion detection. They also showed that using only the first three octets of IP addresses can be efficient in handling the use of dynamic IP addresses, representing the strangeness of DNN in the attendance of DHCP. This approach showed that autoencoders could be used to detect anomalies wherever they trained on expected flows.

Kwon [13] proposed Recurrent Neural Network RNN and Deep Neural Network DNN with ML techniques related to anomaly detection in the network. They also conducted local experiments showing the feasibility of the DNN approach to network flow traffic analysis. This survey also investigated DNN models’ effectiveness in network flow traffic analysis by introducing the conducting experiments with their FCN model. This approach shows encouraging results with enhancement accuracy to detect anomalies compared to the conventional techniques of ML such as SVM, random forest, and Ad boosting.

Garg et al. [9] presented a hybrid data processing model for detection anomaly in the network that influences Grey Wolf optimization and Convolution Neural Network CNN. Improvements in the GWO and CNN training approaches improved with exploration and initial population capture capabilities and restored failure functionality. These extended alternatives are mentioned as Improved-GWO and Improved CNN. The proposed model runs in two stages for detection anomaly in the network. At the first stage, improved GWO was utilized for feature selection to attain an ideal trade-off among two objectives to reduce the frailer rate and minimize the feature set. In the second stage, improved CNN was utilized for the classification of network anomalies. The author said that the proposed model’s efficiency is evaluated with a benchmark (DARPA’98 and KDD’99) and artificial datasets. They showed the results obtained, which validate that the proposed cloud-based anomaly detection model was superior to the other related works utilized for anomaly detection in the network, accuracy, detection rate, false-positive rate, and F-score. The proposed model shows an overall enhancement of 8.25%, 4.08%, 3.62% in detection rate, false positives, and accuracy, respectively, related to standard GWO with CNN.

Feature extraction

Umer et al. [27] Proposed a flow-based IDS which gets IPFIX/Net Flow records treated as input. Each flows record can have several attributes. Some of these attributes are tacked to the classification model for the decision, while others are used in computational. The significant attributes such as originating IP address destination port play an essential part in the detection judgment’s proposed approach. They conducted feature selection to select related attributes required for increasing the performance of the decision. They conducted a preparing process for flow records to convert them into a specific format to be acceptable to anomaly detection algorithms.

Nisioti et al. [21] presented a survey of the unsupervised model for the IDS system. This model’s features are extracted from different evidence sources as network traffic, logs from different devices and host machines, etc. Unsupervised techniques proposed to consider as more flexible to the additional features extracted from different sources evidence and do not need regular training back. They also proposed and compared feature selection methods for IDS. This survey finds and uses the optimum feature subset for each class to decrease the computational complexity and time.

Münz et al. [19] presented a detection model for the anomaly in network traffic using a clustering algorithm, which is K-Means for input. The proposed detection model takes captured hypervisor packets and composes them into a stream of packet flows related to operating system time. The model consists of two phases of feature extraction based on the packet’s header as a primary feature vector computed for each unique packet. The second phase extracts a separated feature vector to every packet flow related to the primary feature vectors attendant with the packets included in the flow.

Aldribi et al. [2] introduced a hypervisor-based cloud for IDS that includes a novel feature extraction approach depending on the activities of user instances and their related behaviors into the hypervisor. The proposed model intended to detect anomalous behavior into the cloud by tracing statistical variations using a grouping of the gradient descent algorithms and E-Div. The new dataset was introduced as an intrusion detection dataset gathered in a cloud environment available and publicly for researchers. The dataset involves multistage attack scenarios that permit developing and evaluate threat environments relying on cloud computing. They conducted an experimental evaluation using the Riemann rolling feature extraction scheme and produce promising results. The dataset carried the number of communications over encrypted channels, for instance, using protocols like SSH.

Detection framework (our approach)

As shown in Fig. 1, the network traffic dataset consists of flow network traffic attributes described in Aldribi et al. [2] with no label. The proposed dataset extracted from network traffic in different period and contains frame time, source MAC, destination MAC, source IP, source port, destination IP, source port, IP length, IP header length, TCP header length, frame length, offset, TCP segment, TCP acknowledgment, in frequency number, and out frequency number. These attributes of network flow can specify packets, whether anomaly or normal. The formulas shown in Fig. 2 can calculate the in-frequency number, and, similarly, the out-frequency number. Other features that are vital and added to the ISOT-CID dataset are. APL is the average payload packet length for a time interval, PV is the variance of payload packet length for a time interval, and TBP means the average time between packets in the time interval [29].

The main significant thing in our research that we added the novel feature. We believe this novel feature gives support for the ML model in the training process. This feature is called rambling.

Most machine learning models are learning from the diversions of instance values. The closer values can support the classification process more accurately. Depending on our knowledge network flow traffic have many different packet sizes through the various type of contents. The network protocols have limited packet size related to industrial Corporations such as Xerox Ethernet V2, intel, etc. Most of them ranged from (64 to 1518) bytes. Suppose we capture a group of packets that have the same destination IP address in a time interval. Let payload of the packet in specific time T is Vi and Xi is the mean of these V (0,1, 2, …. n) the rambling feature (R) calculate for each instance flow for the interval (t, dt) as the following.

$$R = |Xi - Vi|in\left( {t,dt} \right)$$

(1)

This new feature (Rambling feature) can reduce each flow packet size difference, supporting the machine learning algorithm's classification process.

The dataset is labeled related to specific normal IPs, including in the data instances, and used in the ML classification model. The classification model, as presented in Fig. 1 which is trained by an updated ISOT-CID dataset able to classify the new feature extracted from the network data flow, whether normal or anomaly, in real-time. Figure 3 summarizes the whole process.

Methods

The methodology of our work illustrated in Fig. 4. It consists of three stages. Stage 1 concerns the dataset preparation, and stage 2 builds the detection model. The last stage will consist of the evaluation stage, which ensures our approach accuracy for anomaly detection.

Dataset preparation stage

Understanding dataset

Cloud computing networks facing security threats, same as the traditional computing networks with some other differences [1]. According to several protocols, services, and technologies such as virtual structures, these additional security threats related to the cloud infrastructure have data formatting levels. With such an environment providing protection should consider all data traffic in both insider and outsider. The remaining challenge of completing this job is building an ML model that trains IDS to capture these various data abstraction anomalies. Furthermore, the extracting features from these several data places need related tools to pass the gathered row data to the trained ML model. The extracting tools should be gathering recent instances of data from several resources in real-time.

ISOT-CID [1] dataset was presented as an exciting job contains several data collections about data transmission behavior and buffer data format. The presented dataset has enough properties and data attributes to train IDS for robust and comprehensive protection. The data collections of ISOT-CID consist of system call properties, network traffic memory dump, events log, and resource utilization. The ISOT-CID cloud intrusion detection dataset contains terabytes of data, including regular traffic, activities, and multiple attack scenarios. The data gathered in several periods in the cloud in a natural environment. This dataset’s content is considered essential for the business industry for developing a realistic intrusion detection model for cloud computing. The ISOT dataset collects various data goatherd from cloud environment and collected from different cloud layers, involved guest hosts, networks, and hypervisors, and encompasses data with various data formats and several data resources such as memory, CPU, system, and network traffic. It includes various attack scenarios such as a denial-of-service masquerade attack, stealth attacks, attacks data from inside and outside the cloud, and anomalous user behavior. ISOT-CID aims to represent a real dataset in the cloud for scientist’s researchers so that they can develop, evaluate and make a comparison of their works. It intends to help various and comprehensive IDS systems development and evaluation.

Furthermore, ISOT-CID is fundamentally raw data and has not been converted, altered, or manipulated. It is prepared and structured for securing the cloud community. In this research, we consider only the network traffic part, as described in the Ph.D. thesis of Aldribi et al. [1].

In this research, we are working on only the network traffic part. The dataset attributes describe in Tables 1 and 2.

Table 1 The feature extracted from network traffic

Apply machine learning techniques to detect malicious network traffic in cloud computing

Abstract

Introduction

Related work

Supervised learning

Unsupervised learning

Cloud-based techniques

Deep learning techniques

Feature extraction

Detection framework (our approach)

Methods

Dataset preparation stage

Understanding dataset

Preprocessing the dataset

Label the dataset

Building detection model stage

Selecting ML technique

Extracting features

Trigger the model and passing features

Evaluating stage

Cross-validation

Split -validation

Results and analysis

Cross-validation evaluation

Evaluating ANN

Evaluating DTREE

Evaluating K-nearest Neighbor (KNN classifier)

Evaluating support vector machine

Evaluating Random Forest

Evaluating Naive Bayes

Split-validation evaluation

Evaluating ANN

Evaluating DTREE

Evaluating KNN

Evaluating SVM

Evaluating Random Forest

Evaluating Naive Bayes

Discussion

Cross-validation result

Split-validation result

Conclusions and future work

Availability of data and materials

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords