Intrusion detection systems using long short-term memory (LSTM)

An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. It scans a network or a system for a harmful activity or security breaching. IDS protects networks (Network-based intrusion detection system NIDS) or hosts (Host-based intrusion detection system HIDS), and work by either looking for signatures of known attacks or deviations from normal activity. Deep learning algorithms proved their effectiveness in intrusion detection compared to other machine learning methods. In this paper, we implemented deep learning solutions for detecting attacks based on Long Short-Term Memory (LSTM). PCA (principal component analysis) and Mutual information (MI) are used as dimensionality reduction and feature selection techniques. Our approach was tested on a benchmark data set, KDD99, and the experimental outcomes show that models based on PCA achieve the best accuracy for training and testing, in both binary and multiclass classification.

extracted data are unimportant and noisy, which leads to the presence of irrelevant features that will deteriorate the performance of the classifier. Hence, dimensionality reduction algorithms, such as PCA (Principal component analysis) and Mutual information are crucial, to select informative data.
In this research, our contribution consists of implementing three models, namely: LSTM, LSTM-PCA, and LSTM-MI (Mutual Information). As shown in Fig. 1, after preprocessing the well-known dataset KDD99, we applied three different approaches, based on Long short-term memory classifier. LSTM without any dimensionality reduction (using all KDD99 features), applying Principal Component Analysis and employing Mutual Information. These approaches were tested on binary and multiclass classification, and the result showed that models based on PCA obtained the best results.

Related works
From the first intrusion detection system introduced by Denning [2], studies applied multiple intrusion detection algorithms. Since traditional classification algorithm need to manually extract features, deep learning techniques proved their effectiveness in this type of problem. Therefore, many researchers have focused their efforts on deep learning techniques to create powerful IDSs.
Anand [3] used an improved genetic K-means++ algorithm and IGKM, on a subset of the KDD99 dataset to create an intrusion detection system. IGKM gave a better accuracy than K-means++. Ma [4] used a Deep Neural Network (DNN) to classify attack type and Spectral Clustering as a feature extractor. The DNN network had a better accuracy than SVM (support-vector machine), BPNN (back propagation neural network) and RF (Random Forest). Nikolov [5], proposed a recurrent neural network classifier namely long-short term memory (LSTM), the authors tested the solution on the NGID-DS dataset for binary classification (Only HTTP Attack). Jayaprakash [6], introduced an anomaly based detection mechanism that implements Role based Access control (RBAC). A new data structure called Octraplet is used for storing the SQL queries. This system uses the Naive Bayes Classifier which is a supervised Machine Learning method for Detecting anomalous queries. Kang [7] proposed an effective IDS on for the in-vehicle network. The system is used to detect attacks in a controller area network (CAN) bus. The model loads the over pre-training belief (DBN) an enhancement in the detection accuracy. Chkirbene [8], proposed two models for intrusion detection and classification, Trust-based Intrusion Detection and Classification System (TIDCS) which reduces the number of features in the input data based on a new algorithm for feature selection and TIDCS-A which is a dynamic algorithm to compute the exact time for nodes cleansing states and restricts the exposure window of the nodes, this method is tested on NSLKDD and UNSW dataset. Erfani [9] combined a DBN (Deep belief Network) and SVM. The first layer of the proposed architecture used DBN to extract features, while the second SVM layer was trained on the features extracted by the DBN network. The results were powerful and fitting high-dimensional domains. Dong [10] proposed a deep learningbased intrusion detection model, AEAlexJNet, which uses Auto-Encoder AlexNet neural network. The experimental results of the intrusion detection data set KDD99 show that the accuracy of the AE-AlexNet model is 94.32%. Dutt [11] imitates the adaptive immune system by taking into consideration the activation of the T-cells and the B-cells. It captures relevant features from header and payload portions for effective detection of intrusion. Experiments have been conducted on both the real-time network traffic and the standard datasets KDD99 and UNSW-NB15 for intrusion detection. The SMAD model yields is 96.04% true positive rate and around 97% true positive rate using realtime traffic and standard data sets. Hanselmann [12] used a neural network architecture for detecting intrusions on the controller area network (CAN), and introduced an unsupervised learning approach called CANet, evaluated on real and synthetic CAN data. A comparison with previous machine learning based methods shows that CANet outperforms them by a significant margin. CNN (Convolutional Neural Network), most commonly applied to analyzing visual imagery, was also used in intrusion detection. Khan [13] introduced an improved CNN on KDD99 dataset. The results showed that CNN can greatly improve the accuracy. Another deep learning algorithm was used by Javaid [14] to construct an NIDS called Self-taught Learning (STL). Experimental results were very encouraging and revealed that this architecture is as good as the previous models. Vijayanand [15] proposed a wrapper-based approach using the modified whale optimization algorithm (WOA) and used a method in which the genetic algorithm operators were combined with the WOA. Using a support vector machine (SVM), then, the authors identified the types of intrusions based on the selected features. Sydney [16] proposes a Feed-Forward Deep Neural Network (FFDNN) wireless IDS system using a Wrapper Based Feature Extraction Unit (WFEU) and compare it to standard machine learning (ML) algorithms that include Random Forest (RF), Support Vector Machine (SVM), Naive Bayes (NB), Decision Tree (DT) and k-Nearest Neighbor (kNN). The experimental studies include binary and multiclass types of attacks. The solution was tested on UNSW-NB15 and the AWID dataset with 87.10% and 77.16% accuracy for the binary and multiclass classification. Shapoorifard [17] used an improved version of K-Nearest Neighbors (KNN) classifier. The author combines K-MEANS clustering and KNN classification and claims that engaging the farthest neighbor enhances the accuracy rate and detection rate and reduces false alarm rate. Tama [18] designed an anomaly-based IDS, with two-step classifier. The TSE-IDS model select features using a hybrid method and collect specific feature representations. The results on the NSL-KDD and UNSW-NB15 datasets improved the detection rates. Li [19] proposed an AE-IDS (Auto-Encoder Intrusion Detection System) based on random forest algorithm. This method constructs the training set with feature selection and feature grouping. After training, the model can predict the results with auto-encoder. Su [20] combines a bat model and BLSTM (Bidirectional Long Short-term memory) and attention mechanism. Attention mechanism is used to screen the network flow vector composed of packet vectors generated by the BLSTM model, which can obtain the key features for network traffic classification. To the best of our knowledge, nobody compared a linear (Mutual Information) and nonlinear (Principal Component Analysis) dimension reduction algorithm, on both Binary and multiclass classification using LSTM. Shen [21] proposed an ensemble method, combining the extreme learning machine (ELM) as a base classifier and a pruning method based on the Bat Algorithm (BA) as an optimizer, resulting an accuracy of 98.94%. Khan [22] conceived an intrusion detection system, based on convolutional neural network algorithm. The entire network consists of three hidden layers. Each hidden layer contains a convolutional layer and a pooling layer.

Long short-term memory LSTM
LSTM is an extension of RNN, introduced by Hochreiter and Schmidhuber [23] in 1997, designed to avoid the long-term dependency issue, unlike RNN, LSTM can remember data for long periods. In RNN architecture (Fig. 2), hidden layers have a simple structure (e.g. single tanh layer), while the LSTM architecture is more complex, It is constituted of 4 hidden layers (Fig. 3)  The principal component of LSTM is the cell state. To add or remove information from the cell state, the gates are used to protect it, using sigmoid function (one means allows the modification, while a value of zero means denies the modification.). We can identify three different gates ( Fig. 4): • Forget gate layer (Fig. 4a): Looks at the input data, and the data received from the previously hidden layer, then decides which information LSTM is going to delete from the cell state, using a sigmoid function (One means keeps it, 0 means delete it). It is calculated as: (Fig. 4b): Decides which information LSTM is going to store in the cell state. At first, input gate layer decides which information will be updated using a sigmoid function, then a Tanh layer proposes a new vector to add to the cell state. Then the LSTM update the cell state, by forgetting the information that we decided to forget, and updating it with the new vector values. It is calculated as: 4c): decides what will be our output by executing a sigmoid function that decides which part of the cell LSTM is going to output, the result is passed through a Tanh layer (value between − 1 and 1) to output only the information we decide to pass to the next neuron. It is calculated as

Feature selection
Having a large number of dimensions in the feature space can dramatically impact the performance of machine learning algorithms, especially in the real-time environment.

Fig. 4 LSTM layers
Therefore, many algorithms of dimensionality reduction and features selection are used on the original dataset to reduce the number of input features. This enables the machine learning algorithm to train faster, reduce the complexity of a model and make it easier to interpret.
In our research, we use two algorithms to reduce the dataset dimensionality, namely Principal Component Analysis (PCA) and Mutual Information (MI).

Principal component analysis (PCA)
Large datasets are increasingly common and are often difficult to interpret. Principal component analysis (PCA) is a technique for reducing the dimensionality of such datasets, by creating new uncorrelated variables that successively maximize variance. Using PCA, we can reduce the computational costs and the error of parameter estimation by reducing the number of dimensions of the feature space by extracting a subspace that describes the data best.
Technically, after standardizing the data, PCA extracts the eigenvectors and eigenvalues from the covariance matrix (CM): where x ′ is the mean vector x ′ = ( 1 n ) k=1 n (x i ) . and the covariance between two features: The eigenvalues are then arranged in descending order, and k eigenvectors corresponding to k eigenvalues are chosen, where k is the number of dimensions of the new feature subspace ( K < d ). Next, PCA builds the projection matrix W from the selected k eigenvectors. And finally, it transforms the original dataset X via W to obtain a k-dimensional feature subspace Y = X * W .

Mutual information (MI)
Unlike PCA, Mutual information calculates the statistical dependence between two variables. It measures how much information is communicated, on average, in one random variable about another and assigns a score for each feature. High MI score between two variables indicates a large reduction in uncertainty; low mutual information indicates a small reduction while zero mutual information score means the variables are independent. Therefore, the mutual information between two discrete variables X| AND Y| denoted I(X; Y), is defined by Cover and Thomas [24] as: Here P X (x) and P Y (y) are the marginals: P X (x) = y P XY (x, y).

Our approach
This section includes the dataset description, data preprocessing, calculation of the PCA variance, calculation of Mutual information scores, implementation and parameters of the proposed model.

The KDD99 dataset
Although KDD99 dataset [25] is more than 19 years old, it is still widely used in academic research. The raw training data were processed into about five million connection records. A connection is a sequence of TCP packets starting and ending at some well-defined times. The dataset contains 53 features, and falls into 4 attacks categories (DoS, Probe, U2R, and R2L) divided into 22 different attacks ( Table 1): • Denial of service attack (DoS): is an attack meant to shut down a machine or network, making it inaccessible to its intended users. DoS attacks typically function by overwhelming or flooding a targeted machine with requests until normal traffic is unable to be processed. • User to root attack (U2R): is an attack where hackers exploit some vulnerabilities to gain root access to the system, providing him unauthorized access to the local superuser. • Remote to local attack (R2L): occurs when the attacker finds vulnerable points in a computer or network security software to gain access to the machine or the system. The main goal of this attack is to explore or steal data illegally, introduce viruses or cause damage to the victim. Probing Attack: Also called Scanning/Discovery, which is the first step of an attack; probing is made to gather information on the targeted system. The network is scanned for known vulnerabilities in infrastructure software, as well as unknown vulnerabilities in the custom code developed for the specific target application.
To improve the quality of raw data, data preprocessing and filtering are required which increase data efficiency. Therefore, KD99 was imported as a Panda Data Frame, then, we collected statistical information about each attack type instance. As we can see (Table 2), the number of attack records is much bigger than normal records. This unbalanced data can lead our model to fail in the classification task. To overcome this problem, we used some sampling techniques that we describe in the next section.

Binary classification
Binary classification is the task of labeling output into two groups. In our case, our binary classifies should have the ability to decide whether a given record is an attack or not. To do that, we group label into two categories: Normal and Attack. Furthermore, to overcome unbalanced data issues, we used random sampling. Hence, we choose 100,000 records for the normal category and 100,148 records for attacks category (Table 3).

Multiclass classification
Unlike binary classification, which classifies instances into two classes, multiclass classification outputs into three or more classes. Due to the unbalanced data problem, we grouped attacks into three categories of attacks which are: Normal, DoS and all the  others in the R2L category. Then, we sampled them in 120,000, 100,000 and 100,000 records respectively (Table 4).

Principal component analysis (PCA)
As we mention above, PCA is a technique to reduce the dataset dimensionality. For that, we calculate the percentage of explained variances between the 53 features for binary classification (Fig. 5a) and multiclass classification (Fig. 5b).
As we can see, the first two components represent more than 71% of the data for binary classification, and 74% for multiclass classification. Therefore, if we add the third component, we get more than 81% for binary classification and 87% for multiclass classification. As a result, we tested our model using a reduced dataset of two and three components.

Mutual information (MI)
We indicated earlier that Mutual information calculates the statistical dependence between two variables. Thus, a score is assigned to each feature, showing how much the latter impacts the result. For that, we calculated the mutual information scores, for binary (Fig. 6a), and multiclass (Fig. 6b).
For binary classification, four features have the best mutual information scores, namely feature 21, feature 3, feature 2 and feature 4. While 10 features have relatively good scores, the other don't, which means they didn't impact the classification result.

Fig. 6 Mutual information calculus
Thus, we tested our model with 04 and 10 features scores. Concurrently, we tested with the same number of features for multiclass classification.

Implementation and evaluation metrics
We have implemented our model on Google Colab, using Keras library (Which is an opensource neural network library written in Python). In Google Colab, we split the preprocessed Dataset into 60% for training Data, 20% for validation and 20% for testing purposes.
On the other hand, our LSTM model (Fig. 7) was configured with the parameters shown in the Table 5.
To determine the performance of the proposed algorithm, a confusion matrix was plotted. Each row of the matrix represents the instances in a predicted class while each column represents the instances in an actual class. Therefore, confusion Table matrix allows

Experimental results
In this research, we gathered binary classification results and multi-classification(3class) results. In both groups of classification, we used PCA and Mutual information for dimensionality reduction, and we applied LSTM as a classification algorithm. Figure 8 summarize the performance values of each model, in term of training accuracy, testing accuracy, sensitivity (recall), precision and F1 scores.

For binary classification
In this experiment, we noticed that LSTM-PCA provided the best results, especially using 02 components. This may be due to that PCA had low noise sensitivity, remove correlated features and smaller dimensions help the classifier to learn easily. We can also notify that Mutual information is better with 4 features than 10 features and that LSTM performs well in the binary classification, but for multi-classification is much less efficient, it may be due that the data set is quite noisy. The confusion matrices of binary classification models are presented in Fig. 9.

For multiclass classification
Same as binary classification, we noticed that LSTM-PCA produced the best results, again, using only 02 dimensions.
The confusion matrices of the multiclass models are shown in Fig. 10.

Processing time
Another important metric is the time processing. Processing times tell us how long we can expect it will take us to process a request. In training 04 features models, LSTM-PCA takes 1397.74 seconds and 876.68 seconds in multiclass and binary classification respectively. While, in LSTM-MI models, 1348.88 seconds and 732.59 seconds have been noticed in multiclass and binary classification respectively.
On the other hand, training 10 features models takes 3,499.26 seconds, 2,209.40 seconds for LSTM-PCA and 3,166.27 seconds, 2,109.26 seconds for LSTM-MI, in multiclass and binary classification respectively.
Finally, using all the dataset features without any dimensionality reduction, takes 11,629.69 seconds and 18,802.28 seconds for multiclass and binary classification. We can remark that adding more feature increases the processing time. The training stage takes more than 15 times longer using all features compared to the models using two features. Allowing an optimizing efficiency in the training process, which saves resources.

Discussion
To improve the analysis of the experimental results, we compared the results with those of previous studies (Table 6).
Comparing multiple algorithm in such way is for reference purpose only. Intrusion detection systems vary in their classification results, as a consequence it is complex for a model to perform well in every environment. Also, as a result of data preprocessing approaches, and interpretation process, the conclusions may differ. However, it can be observed that our model (LSTM-PCA), using two features, had the best performance. It achieved the most superior accuracy and sensitivity rates.
In addition, the author of [22] used all the KDD99 features, causing a longer time for training and testing, which make this model not suitable for a realtime environment.
On the other hand, our model uses only two features. Therefore, it is the easiest to preprocess and requiring less time to train. Thus, it makes our model well suitable for largescale and high-dimensional domains. The global performance of our proposed architecture is more efficient than the other deep-learning models. This again proves our architecture has a better generalization performance and robustness than the compared models.

Conclusion and future works
In this paper, through the research of intrusion detection systems and neural networks, we presented a comparison, between different models, based on Long-Short Term Memory (LSTM). Principal Component Analysis and Mutual Information was used as dimensionality reduction algorithms, Then, we analyze the performance and the time processing of these approaches.  The LSTM model is capable to learn successfully the features extracted from the dataset in the training period. This capability allows the models to distinguish effectively the normal traffic from the network attacks.
To implement the proposed architectures, we used Keras library and TensorFlow on Google Colab platform. Principal Component Analysis and Mutual Information were applied as dimensionality reduction algorithms. The intrusion detection was based on binary and multiclass classification.
We noticed that architectures based on PCA, especially with 02 components had the best outcomes, for both binary and multiclass classification, with 99.44% and 99.39% respectively.
The accuracy and sensitivity are greater than the compared approaches, which also demonstrates the efficacy of the proposed method. Moreover, using only 02 features make our model easier to train, taking less time and resources than the others. However, in this paper, we tested with only one type of LSTM. In future works, we will investigate multiple variants of LSTM (like Peephole LSTM, Multiplicative LSTM and Weighted LSTM), in addition to other neural network algorithms and other feature selection algorithms.