Model fusion of deep neural networks for anomaly detection

Network Anomaly Detection is still an open challenging task that aims to detect anomalous network traffic for security purposes. Usually, the network traffic data are large-scale and imbalanced. Additionally, they have noisy labels. This paper addresses the previous challenges and utilizes million-scale and highly imbalanced ZYELL’s dataset. We propose to train deep neural networks with class weight optimization to learn complex patterns from rare anomalies observed from the traffic data. This paper proposes a novel model fusion that combines two deep neural networks including binary normal/attack classifier and multi-attacks classifier. The proposed solution can detect various network attacks such as Distributed Denial of Service (DDOS), IP probing, PORT probing, and Network Mapper (NMAP) probing. The experiments conducted on a ZYELL’s real-world dataset show promising performance. It was found that the proposed approach outperformed the baseline model in terms of average macro Fβ score and false alarm rate by 17% and 5.3%, respectively.


Previous work
Statistical algorithms for NAD track the network behavior using probabilistic models of anomalies. Anomalous attacks are associated with abnormal changes in the data flowing through a network. Generally, these exceptional changes are detected via hard threshold modelling techniques. The major drawback of hard threshold modelling for statistical approaches is the generation of high false alarms [15]. Consequently, Statistical approaches aimed to develop methods that can help to reduce false alarms.
Real time NAD has been developed using wavelets combined with sketches [16]. This method was a router level analysis that extracts NetFlow traces by converting traces into ASCII files. Then, the sketches used hash functions to aggregate traffic flows in the sketch tables. Next, the produced time series were used to discover discontinuities by wavelet transform.
Correlational Paraconsistent Machine (CPM) is another method that has been developed for NAD. CPM used two methods including non-classical paraconsistent logic (PL), and unsupervised traffic characterization [17]. For example, a study has developed NAD based on both auto regressive integrated moving average (ARIMA), and ant colony optimization for digital signature (ACODS) [18] to generate two distinct network profiles that can identify normal network traffic data.
Classification methods have been widely used for NAD problem [19]. For example, Naive Bayesian classifier (NBC) has been used to detect Distributed Denial of Service (DDoS) attacks, selective forwarding, and black holes [20]. The developed NBC system was utilized to monitor packets moving between nodes to check the behavior of data for abnormality detection. The classifier calculates the probability of the samples that belong to a class based on normal distribution probability approach. Support Vector Machine (SVM) is another known classifier that was used to find patterns and provide autonomous recognition of normal data traffic in NAD problem [21]. The main problems include training SVM classifier with imbalance in class distribution and outlier sensitivity of decision boundary. To address previous problems, two modifications of the unsupervised one-class SVM have been proposed including eta one-class SVMs, and Robust one-class SVMs [22]. Least square support vector machine (LS-SVM) was found as a modification of standard SVM classifier that has been used for intrusion detection [23]. LS-SVM is more sensitive to noise and anomalies compared to a standard SVM.
An ensemble approach for NAD has been demonstrated using AdaBoost algorithm, that combines multiple classifiers including decision tree, K-nearest neighbor (k-NN), naive Bayes, SVM, and multilayer perceptron (MLP) [24]. The AdaBoost algorithm was used to initialize data distribution, classifiers training, error evaluation, and weights assignment to each of the classifiers. Then, weighted voting approach has been used to combine the classifiers prediction of outliers.
Delayed Long Short-Term Memory (dLSTM) was a recent deep learning method that has been used for NAD problem with time-series data [10]. A predictive model was developed based on normal training data, then anomalies were detected from observed data using prediction error. The study proposed to develop multiple LSTM predictive models and produce multiple prediction values. Then, the model with the predictive value that is closest to the measured value was selected. Their developed model can delay the timing of prediction until the associated measured value is acquired.
NAD problem is usually associated with extreme class imbalance issue. However, recent studies, that have considered the data imbalance problem associated with NAD solutions are still limited [25]. Several techniques such as algorithmic-level and data-level approaches have been proposed in the imbalanced class domain and found to improve the performance of models. Algorithmic-level methods have been used to handle the issue of data imbalance by modifying existing algorithms [26] such as hyperparameters optimization for imbalanced data classification [27]. On the other hand, date-level or resampling methods were standard approaches that can handle class imbalance issue by adding more data (creating synthetic samples) to the original dataset and generating new balanced dataset [28]. However, learning from imbalanced data is still an open challenge.
Recent study has been proposed to introduce additional attributes to seven different imbalanced datasets for NAD [25]. The attributes include an outlier score and four types of samples including borderline, safe, rare, and outlier to gain additional information, enrich imbalanced data characteristics, and improve the classification performance.
Cascade two-stage deep learning model based on a deep stacked auto-encoder was demonstrated for network intrusion detection [29]. It includes two stages with two hidden layers in each. The deep learning model was trained in unsupervised manner on unlabelled traffic and was fine-tuned using labelled traffic data. The first stage was used to classify the normal and abnormal traffic. On the other hand, the second stage detects normal with other types of attacks.
A convolutional neural network-based payload classification (CNN) and a recurrent neural network-based payload classification (RNN) were used separately for attack detection [30]. Additionally, ID hybrid convolutional recurrent neural network (CRNN) was used to predict malicious attacks in the network [31]. The CNN and the RNN capture local and temporal features, respectively. Convolution Neural Network (CNN) was utilized to extract the accurate representation of data that were classified by Long Short-Term Memory (LSTM) Model [32].
In summary, deep learning methods for attack detection are divided into several categories [33]: This paper is organized as follows: "Previous work" Section describes datasets, data splitting, and data pre-processing. The proposed approach of model fusion was demonstrated in "Materials and methods" Section. In "Results and discussion" Section, the experimental setup, performance metrics, and results evaluation are discussed. Finally, "Conclusion and future work" Section summarizes the outcome and significance of this work and the open doors for future enhancement.

Dataset overview
The dataset used in this work was a million-scale dataset of real-world network traffic. It was released by ZYELL group and National Chiao Tung University for network anomaly detection challenge [13,14]. The data is a time series of network traffic records captured by ZYELL's firewall. Each network traffic record is a network connection session and is labelled as either normal or a specific type of attack. The proportion of anomalies is about 1% during the network connection session [13]. This dataset was stored as a collection of csv files with 981 MB (3 csv files) of training and 1.28 GB (4 csv files) of testing. The training dataset contains 3-date traffic logs with total of 9,241,463 samples given with labels. The three files are [13]: A. The first file, '1210_firewall.csv' contains 3,265,630 traffic logs. B. The second file, '1203_filewall.csv' contains 2,809,865 traffic logs. C. The third file, '1216_filewall.csv' contains 3,165,968 traffic logs.
On the other hand, the separated testing dataset contains 4-date traffic logs with total of 13,290,530 samples that were given in the challenge without labels. The four files are [13]: A. The first file, '0123_firewall.csv' contains 3,601,186 traffic logs. B. The second file, '0124_filewall.csv' contains 2,050,710 traffic logs. C. The third file, '0125_filewall.csv' contains 2,120,819 traffic logs. D. The fourth file, '0126_filewall.csv' contains 5,517,815 traffic logs.
In Table 1, examples of the ZYELL data were shown. The traffic record has one label column and 22 features including connection duration (seconds), inbound/outbound traffic count (bytes), protocol ID, application name, number of unique source and destination IP addresses in the last T seconds, and others [13].
Where T, T′, N are the selected secret numbers determined by challenge organizers [13].
This dataset targets two main categories of attacks such as denial of service (DOS) and probing. The training dataset is distributed unevenly into five categories as shown in Table 3. These categories include normal traffic, and four types of attacks including DDOS smurf. Probing Nmap, probing port sweep, and probing IP sweep [13].
Distributed Denial of Service (DDoS) is the most common type of attack that tries to stop traffic flow from and to the target system. This attack comes from different sources to target the network by flooding it with an abnormal amount of traffic, and thus the target system shutdowns to protect itself [13]. As a result, normal traffic cannot flow to network. An example of DDoS attack is when attackers send a huge number of requests to a target network as online orders for a predefined time interval which prevents customers from paying to purchase online. Smurf DDoS is a type of DDoS that occurs at  packets. On the other hand, probe is another type of attack that tries to steal important information such as personal and banking information [13]. There are three types of probe attacks: (1) IP sweep probing is ICMP echo requests (pings) sent by an attacker to several destination addresses. When the target replies, the attacker would be able to see the target's IP address from the reply [34]. (2) A port probing or scanning is a series of messages sent by an attacker to break into a computer with a port number to gain unauthorized access to sensitive information [34]. IP addresses and ports in a network [13,34].

Dataset splitting protocol
Usually, in machine learning experiments, the data should be divided into three separated and unique sets without replicating any sample in any set. The three sets are   Table 4. The splitting of training and testing followed the common rule (80/20) which is summarized by taking 80% of the dataset for training and 20% for testing. After that, the new training set was divided again into two sets including training and validation by the same 80/20 rule. The division was done as follows: (1) Divide the data randomly without replicating any sample in any set.
(2) Be sure that all five categories are available in each set (training, validation, and testing).
The training set was used to train DNN to find model's weights. Furthermore, validation set was used to optimize network's architecture and finetuning hyperparameters. Finally, the testing set was used to evaluate the model and calculate the evaluation metrics. The 22 features have few features that are not important such as time, source IP, destination IP, source port, and destination port. These features were removed to have 17 features in each record.

Normalization techniques
The feature vector x was scaled using standard scaler by removing the mean and scaling to unit variance as follows: where u is the mean, s is the standard deviation.
After scaling, clipping was done to clip the feature values between − 50 and 50 to avoid extreme outliers.

Converting categorical data to numerical representation
The column of application name is categorical and has 45 unique string values as follows [13]: ['others' , 'domain' , 'https' , 'snmp' , 'icmp' , 'http' , 'microsoft-ds' , 'ssdp' , 'netbios-ssn' , 'netbiosdgm' , 'ssh' , 'netbios-ns' , 'ftp' , 'syslog' , 'igmp' , 'h323' , 'real-audio' , 'pop3' , 'telnet' , 'smtp' , 'rtsp' , Each of these unique values is repeated in the traffic record in different manners. For example, https is repeated 1,577,502 times, while sftp is only repeated 8 times. Figure 1 shows the 45 unique values in application name column with the frequency of each. The string values in application name column were converted to numerical values for further processing. One-hot encoding of column with 45 unique values result in sparse matrix with large number of zeros. Therefore, to avoid memory problem, the column was not encoded. However, it was rescaled and clipped.
The column of label is also categorical and has highly imbalanced five categories. Figure 2 illustrates the five unique values in label column with the frequency of each. The values in label column were encoded and converted to a binary form using one hot encoder.
Few numeric columns in this dataset such as cnt_src have discrete number with few tens of unique values. Figure 3 shows the values in cnt_src column with the frequency of each. The value 1 was repeated in this column more than 6 million times, while the values between 2 and 10 were repeated between 100 and 800 hundred thousand. On the contrary, other values such as ones between 20 and 40 have the frequency less than 200.

Correlation
In this section, the correlation and degree of correlation were calculated between the features of traffic samples. The matrix of correlation between each pair of features is graphically represented as a heatmap with color-coding as shown in Fig. 4 On the other hand, zero or near zero values of correlation demonstrate that this pair of features is weakly correlated. The significance of correlation matrix is related to feature selection which is the main stage before classification. When two features are highly correlated, one of the two features can be dropped.
It is obvious in Fig. 4 that label (output) which should be predicted is not highly correlated with any input feature of traffic. Additionally, there is only medium correlation (0.5-0.8) between the feature with its versions that have suffix of '_slow' and '_conn' . For example, cnt_src has medium correlation with cnt_src_slow and cnt_src_conn.  Therefore, no feature in traffic was dropped because no one has high correlation with others.

The proposed model fusion
In this section, the proposed approach of model fusion is described. The model fusion contains two deep neural networks. The binary model 1 includes feature pre-processing and DNN. The DNN was used as a binary classifier to detect any attack by classifying traffic data into two categories: normal and attack. To compose new attack traffic set, four types of attacks including DDOS smurf, IP probing, PORT probing, and NMAP probing were combined into one set as shown in Fig. 5. The two sets of new attack traffic and normal traffic were fed to the binary DNN.
The multi-class model 2 includes feature pre-processing and DNN. The DNN was utilized as a multi-class classifier to categorize attacks into four classes after removing normal traffic data as shown in Fig. 5. The multi-class model 2 is run only if the model 1 produces an attack category. Otherwise, when normal traffic was produced at output of model 1, the model 2 is not run. The last dense layer has 2 classes in normal/attack DNN, 4 classes in multi-attacks DNN. On the other hand, the proposed approach of model fusion was compared with the baseline model. The baseline is a single deep neural network that has trained on data with five categories including normal traffic and four types of attack traffic to categorize the traffic data into 5 classes. The DNN in the baseline method has the same architecture of each of two DNNs utilized in the proposed approach as shown in Table 5.    The class weight optimization approach was used for model's training. The loss function which measures the performance of a classification model aims to minimize the cross entropy between the predicted probability of a sample and the actual probability. In the proposed model, the loss function used is a weighted average, where the weight of each sample is specified by a class weight. This method gives different weights to both the majority and minority classes. It aims to penalize the misclassification of traffic into normal or attack from one side and into various attack types in the second side. The penalization was done by giving a higher weight to the minority class (attack traffic samples) and a lower weight to the majority class (normal traffic samples). The weight of each class is calculated as follows: where total is the sum of all samples, s is the number of samples for class i, n is the number of classes.

Results and discussion
The experiments were conducted in this work to compare the proposed model fusion with the baseline method. Two scenarios were carried out. The first one is to validate the proposed solution of model fusion by using validation and testing sets obtained from training set as shown in Table 4. On the other hand, the second scenario evaluated the proposed model fusion with an external testing set including new 4-date traffic logs.

Evaluation metrics
In this section, we discuss specific performance metrics, that are ideal to evaluate methods trained on imbalanced data. The evaluation metrics used in this paper are as follows: a) Recall (Sensitivity) is a measure that calculates the number of traffic samples predicted correctly as attacks over the number of all attack traffic samples.
b) Precision is a measure that calculates the number of traffic samples predicted correctly as attacks over the number of all traffic predicted to exhibit attacks, both correctly or incorrectly.
TP TP + FP c) F β score is the weighted harmonic mean of precision and recall. The beta parameter determines the weight of recall in the combined score. If beta is smaller than 1, more weight is given to precision. If beta is greater than 1, the recall is given more focus.
d) Evaluation Criteria of ZYELL's challenge [13] = max cost = max cost value × number of total entities. In ZYELL NAD challenge, the value of beta and alpha were determined by challenge organizers as follows: β = 2, α = 0. 3 The total cost is the cost value calculated by a given cost matrix shown in Table 6. e) False Alarm Rate (FAR) The ratio between the number of normal traffic wrongly predicted to exhibit an attack and the total number of actual normal traffic. The optimal NAD system should produce low value of FAR.

Evaluation results
In this paper, we presented the evaluation metrics of the baseline model which includes single multi-class (5-class) DNN trained on ZYELL's dataset. Additionally, the performance of the proposed solution, including two separated DNNs trained and evaluated on the ZYELL's dataset [13], was also evaluated. The comparison was done in terms of average micro of precision, recall, F1 score, and Fβ score. It was found that our proposed model fusion outperformed the baseline in terms of average macro precision by more than 11% as shown in Table 7. Although the baseline produced 5% better recall than our method, the F1 score and Fβ score of the proposed method were largely better than ones of the baseline by 14, and 17% respectively. The evaluation in Table 7 was done utilizing testing set that we mentioned in "Dataset splitting protocol" Section. Figures 7, 8 illustrate the confusion matrices of the baseline and the proposed, respectively.
(3) Additionally, the comparison between the proposed and the baseline methods was done in terms of false alarm rate which is a significant performance indicator in NAD systems. The results in Table 7 shows a remarkable improvement by 5.3% in reducing the false alarm rate in the proposed approach compared to the baseline. This improvement was achieved without degrading the performance of detecting real attacks.  As shown in Figs. 7, 8, the proposed approach was able to detect 1,743,175 normal traffic compared to 1,647,285 normal traffic detected by the baseline. In other words, about 1,00,000 traffic records were misclassified by the baseline and lead to false alarm. The reason behinds the big difference in false alarm rate is that even both models were trained with the same data but the binary DNN in the proposed approach learned the patterns to distinguish between normal and attack traffic only. On the other hand, the patterns learned with the baseline target to classify the traffic into 5 classes. This confirms the significance of patterns learning in NAD system to improve the detection performance.
The evaluation in Table 8 was done using external separated testing set mentioned in "Dataset overview" Section. The score of the proposed model fusion calculated by evaluation criteria given by Eq. (4) was 30.24% which outperformed the score of the baseline solution (19.18%) by 11%.
This big difference in evaluation criteria score was mainly caused by the value 2 in cost matrix shown in Table 6. This cost results from misclassification of normal traffic records as DDoS traffic. By comparing between the two confusion matrices, the proposed method misclassified only 868 records compared to 35,868 traffic records misclassified by the baseline.

Conclusion and future work
In this paper, a novel strategy of anomaly detection and classification was proposed for network security purposes. A model fusion, that combines binary normal/attack DNN to detect the availability of any attack and multi-attacks DNN to categorize the attacks, was demonstrated. Furthermore, this paper addressed the problem of million-scale and highly imbalanced traffic data. The proposed solution was trained, validated, and tested with real world ZYELL's dataset and the results were promising. It was found that our solution outperformed the baseline solution in terms of Fβ Score by 17%. Additionally, the proposed solution played a significant role to reduce the false alarm rate that most of NAD systems are suffering from by 5.3%. Usually, the false alarm reduces the reliability of NAD system. Therefore, reducing the false alarm rate can make NAD system more robust and reliable. However, low false alarm rate in the proposed solution did not degrade the ability to detect real attacks.
For future work, we aim to enhance the performance by using other types of deep learning models such as 1D convolutional neural network (CNN) to learn spatial features and long short-term memory (LSTM) to learn temporal features. In addition, unsupervised learning of LSTM autoencoder is also a promising candidate solution for this million-scale dataset. Table 8 The score of evaluation criteria for the proposed and baseline models using external testing dataset