Highly accurate and efficient two phase-intrusion detection system (TP-IDS) using distributed processing of HADOOP and machine learning techniques

Network security and data security are the biggest concerns now a days. Every organization decides their future business process based on the past and day to day transactional data. This data may consist of consumer’s confidential data, which needs to be kept secure. Also, the network connections when established with the external communication devices or entities, a care should be taken to authenticate these and block the unwanted access. This consists of identification of the malicious connection nodes and identification of normal connection nodes. For that, we use a continuous monitoring of the network input traffic to recognize the malicious connection request called as intrusion and this type of monitoring system is called as an Intrusion detection system (IDS). IDS helps us to protect our network and data from insecure and malicious network connections. Many such systems exists in the real time scenario, but they have critical issues of performance like accuracy and efficiency. These issues are addressed as a part of this research work of IDS using machine learning techniques and HDFS. The TP-IDS is designed in two phases for increasing accuracy. In phase I of TP-IDS, Support Vector Machine (SVM) and k Nearest Neighbor (kNN) are used. In phase II of TP-IDS, Decision Tree (DT) and Naïve Bayes (NB) are used, where phase II is the validation phase of the system for increasing accuracy. Also, both the phases are having Hadoop distributed file system underlying data storage and processing architecture, which allows parallel processing to increase the speed of the system and hence achieve the efficiency in TP-IDS.

network and block the same, if the malicious one.To examine such incoming network connection, the monitoring systems are established.Such, monitoring systems accepts the incoming network traffic and classifies it, whether it is a normal connection or malicious connection request.If, it is malicious one, then monitoring system blocks the connection request, else allows the connection request.Such, malicious connections are called as intruders and monitoring systems which detects the intruders are called as intrusion detection systems (IDS).[1] Defines IDS as "An intrusion-detection system can be described at a very macroscopic level as a detector that processes information coming from the system that is to be protected".This detector uses three kinds of information types: long-term information related to the technique used to detect intrusions (a knowledge base of attacks, for example), configuration information about the current state of the system, and audit information describing the events that are happening on the system [1].

Types of intrusion detection system
IDS can be categorized in two different ways, based on location and based on detection mechanism.The types of IDS based on location are Network based IDS (NIDS) and Hos based IDS (HIDS).NIDS is the IDS which is placed at the network boundary to detect the unauthorized intruder connection requests, whereas HIDS is the system which is placed at the standalone system to monitor the traffic for internal and external attacks [2].The types of IDS based on detection mechanism are Misuse detection systems and Anomaly detection systems.Misuse detection is the IDS which matches the known attack patterns with input traffic, whereas Anomaly detection is the IDS which defines the boundary for normalcy of the behavior and anything outside the boundary detected as the malicious or anomaly behavior.Many attempts have been made and come up with different IDS solutions of these types.Every such IDS system has its advantages and limitations based on the methods implemented as a part of the IDS architecture [3].In the given categories of IDS, the anomaly detection has proven a valuable approach for providing the effective network security as well as network management [4].Over the years many different models are developed for Anomaly detection, using different methods and techniques for achieving accuracy in the results.

Analogy of machine learning and intrusion detection system
Machine learning is the process, which allows to develop automated decision making systems based on the available collection of the data called as data sets.In Machine learning, the larger the amount of the data is used for training of the system, more the accuracy of the system is achieved, provided a good quality data is available.Anomaly based systems also work the same way and basically have three phases like parameterization, training phase and finally the detection phase [5].Every machine learning model has very important training phase which ensures the learning of the model for automated decision making as a solution of the problem.By taking this advantage of the similarity between the anomaly detection systems and the machine learning models, the solution can be effectively constructed by developing a model with the combination of different machine learning techniques.The time requirement for general network IDS is much higher than the service time in distributed processing environment [6].The data of any size can be processed in distributed environment by using the HADOOP distributed file system.To increase the speed of the machine learning algorithms execution, this distributed processing is very effective data processing system and it is widely used.The intrusion detection is very critical operation from the time's perspective, it should identify the intrusion or malicious activities within acceptable time frame only, to avoid any loss of data confidentiality, data damage, damage to organizational data assets.Many IDS systems are proposed and implemented, but every system has come up with a limited success rate in view of the accuracy of the detection and efficiency or time requirement.Among the techniques used for IDS implementation, machine learning has gained the more attention as compared to other systems.Machine learning enables higher accuracy in detections with respect to use of the model and with respect to time.Hence, it is preferred by most of the researchers.

Machine learning techniques
Machine learning consists of different techniques like supervised machine learning, unsupervised machine learning and semi supervised machine learning.Every machine learning techniques has two phases-training and testing.In training phase, the machine learns for the problem input and output, wherein testing phase, machine is given only input data, based on the knowledge extracted during training, machine generates the output.More the data is passed in the training phase, more is the accuracy we obtain in testing phase.In Supervised machine learning techniques, the input data and its corresponding output is passed during training phase, machine has to find out the mapping function or transformation function between the input and output passed.Classification is the example of the supervised machine learning technique.In unsupervised machine learning techniques, the input data is passed only, machine has to extract the knowledge about the mapping function between input and output, and also machine has to generate the output.Clustering is the example of the unsupervised machine learning technique.In semi supervised machine learning algorithm, we pass the input and output if available, else we pass only input, the machine has to extract the mapping function and output also, if not available.Supervised machine learning are very popular methods in developing machine learning models solutions.The popular supervised machine learning techniques of classification are SVM, Decision tree, naïve bayes etc. Unsupervised machine learning techniques are also very popular and effective, especially when the output data values are not available, these methods are proven effective.The popular unsupervised machine learning techniques of clustering are kNN clustering, k Means clustering, DBSCAN etc.Among the mentioned machine learning algorithms, supervised learning algorithms gives a good accuracy in case of known attacks but fails to detect the unknown attacks, which is the major area of concern.Wherein, unsupervised learning algorithms help us to detect the unknown attacks, but the result accuracy should be verified with other techniques.It's not possible to consider or trust the results generated by these methods alone.Classification techniques, during training phase gets the idea about input and its respective outputs, it only extracts the mapping function, because of which only the known input type is correctly classified by these classification techniques.Clustering techniques, during training phase gets only input values, it extracts the mapping function as well as the unlabeled output of the input value.Machine learning is often useful when these techniques are used together to form a model as a solution to the problem.A combination of different algorithms under different machine learning techniques can be used to develop the effective solutions.Many different solutions for different problems are proposed and developed by researchers by using such combination of machine learning techniques only.Every technique, in the supervised and unsupervised learning, has its advantages and disadvantages.In spite of this, these techniques, if combined properly, can give excellent results.In the classification techniques Support Vector Machine (SVM), Decision Tree (DT), Naïve Bayes (NB) are few of the popular techniques, which helps us to generate better performance models.Also, in clustering techniques kNN clustering, k Means, DBSCAN, Hierarchical clustering are also few of the popular clustering techniques, which helps us to deal with the problems having unknown and non-categorized data.Intrusion detection is one of those problems where, the known attack patterns can be detected by using machine learning classification techniques.But, the major concern is when the attack is unknown attack pattern.Hence, here we need a model which can detect the unknown attack pattern also and that too with a higher accuracy.So, it is important to develop a model with the combination of such techniques, which can help us to get better accuracy results in both known and unknown attack input patterns.

Big data environment-HADOOP for IDS to achieve timeliness and fault tolerance
It is important to get the IDS results within a minimal time frame, and when we are using a combination of different techniques, it might slow down the execution time of the system.Hence, if we execute these combination of techniques with the underlying data processing system in distributed environment, then faster results can be generated.So, distributed and parallel data processing structures like HADOOP, which is a distributed file system for data storage and data processing of the system, can be used to develop the machine learning models with acceptable execution rates, especially in a time sensitive applications like Intrusion Detection System.Also, one of the important components in HADOOP is HDFS i.e., Hadoop Distributed File System, which is the distributed and parallel processing architecture.HDFS enables the distributed data storage and processing, making HADOOP the fault tolerant big data environment.Hence, HADOOP is used in implementation of the IDS design to achieve the inherited advantages of HADOOP such as timeliness and fault tolerance.

Literature survey
The survey of different existing IDS implementations is carried out, before finalizing the solution.While carrying out survey, the advantages and disadvantages of these systems are observed.The survey study consists of few successfully implemented IDS solutions.The survey is as follows:

Intrusion detection system using support vector machine (SVM)
In [7], Halimaa et al. have presented the intrusion detection system model using machine learning approach.Two machine learning classifiers are used, SVM and Naïve Bayes classifiers.It is concluded that, SVM provides good accuracy as compared to naïve Bayes.Feature selection is used for selecting useful and relevant features for model development.These accuracy results are valid only for known attack detection and cannot be considered in unknown attack detection systems.The timeliness is also the major area not addressed in this work.
In [8], Yang et al. have presented the work of support vector machine (SVM) and its performance.It is observed from the experiments that SVM is faster machine learning technique.The analysis is done using three kinds of indicators sensitivity, specificity, time-consumption.
In SVM performance evaluation using different data sets, and different kernels such as linear, polynomial, sigmoid and RBF, RBF kernel gives good performance [9].Also, the results observed are encouraging as compared to other techniques.The SVM provides better results for limited data set size, but in case of huge data size, time complexity is more [10].Also, the SVM is not so popular for the imbalanced data and unknown labelled data [10].Hence, the models cannot be developed with standalone SVM method.
In [11], Agarap et al. have designed the intrusion detection system using neural network architecture with Gated Recurrent Unit (GRU) and Support Vector Machine (SVM).In the presented model the SVM is used at an output layer of GRU-RNN, SVM is faster in terms of time complexity.The training and testing is done only with binary classification for SVM.The results should be generated with mlticlassification and that is one of the important things.Also, the classification using SVM gives better results for known attack traffic and should be verified with unknown attack traffic.
In [12], Jha et al. have designed the intrusion detection system using Support Vector Machine (SVM).The data set used for the model training and testing is NSL KDD dataset.With, feature selection, best features selected based on k Means algorithm criteria.With the reduction in the number of features, the time required for execution of the SVM is also reduced.But, the disadvantage is, SVM is a classification technique, which requires, prior knowledge of events and data.Hence, in intrusion detection systems, SVM model is useful only in case of known attack detection and fails to detect unknown attacks attempts.
SVM gives good performance, when the effective feature selection algorithm is used.In [13], for cloud intrusion detection system (CIDS), the SVM is used to build a model with the Correlation based feature selection (COFS), which helped to achieve better results.
In [14], Teng have presented the CAIDM model named as Collaborative and Adaptive Intrusion detection model using machine learning classification techniques, SVM and decision tree.The KDDCUP99 dataset is used for training and testing of the CAIDM model and it is concluded that, this model with the use of SVM and DT generates better results than using the single SVM for intrusion detection system.Also, the important point to note here, is that, both of these methods in CAIDM are classification methods, which are useful for prior knowledge attacks and not for identification of new attack type input.Also, timeliness is not considered for the model execution.

Intrusion detection system using k nearest neighbor (kNN)
In [15], Cover and et al. have presented the nearest neighbor pattern classification technique called as a k-NN classification.The data point is classified based on the distance between the different data points of other classes.The data point is classified to the class of the data points, whose distance is minimum as compared to other classes' data points.The work states that, the error is minimum for 1-NN and same can be implied for k-NN classification with minimized error.It is the classification technique, which is faster pattern classification technique.
In [16], Imandoust et al. have proposed a study of kNN for prediction of economic events.The same method is used for regression by assigning the value of the property to the object property by calculating the average of the property value for all its neighbors.Hence, kNN is used as efficient method for prediction of the economic events.The kNN is a faster algorithm and very useful, even in the absence of the prior knowledge of the events.
In [17], Ali et al. have presented the detail study about performance of k nearest neighbor algorithm for heterogeneous data sets.The kNN performance is evaluated using Euclidian distance and Manhattan distance formulas.It is found that, the Euclidian distance formula does not provide better kNN results.Also, for heterogeneous data sets, not much difference is observed in the performance of the kNN algorithm.
In [18], Benaddi et al. have proposed the robust intrusion detection system model using Principal Component Analysis (PCA)-Fuzzy Clustering-kNN.NSL KDD data set is used for preparing the model of IDS.It is concluded that, the important thing is to reduce the set of features in the data set to achieve the desired performance of the IDS.With the presence of kNN, it was possible for the model to classify the different attack types input effectively.But, the problem is the accuracy of the kNN should be verified properly.Also, the time complexity of the system is the area of concern in the given IDS model.
In [19], Li et al. have presented the intrusion detection system based on binary classification and kNN.The model is divided into two steps.The first step is related to binary classification to detect the abnormal connections and in second step, kNN is used for detecting unusual and new input types.It is concluded that, when kNN is used, IDS model found accurate than the single binary classification in IDS.But, the accuracy isn't verified after step 2, as it is required for kNN and also the timeliness is not tested in the given results, which cannot be ignored in the IDS model.
In [20], Li et al. have presented the intrusion detection model for wireless sensor networks using the kNN classification.The attack type targeted for detection is flooding attack types.Model is proved to be effective in detecting the flooding attack types using kNN classification.But, the problem is, alone kNN is not effective when different attack types or new attack types are to be detected.

Intrusion detection system using decision tree
In [21], Yihunie et al. have presented the work of anomaly intrusion detection using machine learning techniques.The different five classification techniques are used in comparison with each other as a part of model.The techniques used are: SGD, logistics regression, random forest, SVM and sequential model.The data set used for training and testing of the models is NSL KDD dataset.The results have shown that, random forest has provided the better accuracy as compared to other algorithms.It is important to note that, this accuracy is valid only for known attacks detection.This work does not consist of unknown attack detection, which is the important thing to be considered.Also, the work does not talk about the speed of the system, timeliness in detection and also the fault tolerant nature of IDS.Hence, this IDS system needs improvement considering these properties, and model should be enhanced with better techniques.
In [22], Chang et al. have presented the network intrusion detection using the support vector machine and random forest.The random forest used for feature selection to increase the accuracy and time required for execution.By using, random forest, out of 41, 14 features are selected, which gave the good accuracy as compared to accuracy by using 41 attributes.These results are for known type of input and hence, accuracy cannot be generalized for new attack type input.
In [23], Kumar et al. have presented the intrusion detection system using the decision tree.The results obtained are having better figures.The decision tree uses the preexisting knowledge for model building and gives better results for the known input types.For, unknown input types, it cannot perform well and should be used with another methods to achieve better results.

Intrusion detection system using naïve Bayes
In [24], Panda et al. have presented the intrusion detection system using naïve Bayes technique.The results obtained by the model are better than the neural network architectures.The model is built with two layers and distance between the information nodes is minimized.The results also shown that naïve Bayes approach gives good results in less time and with low cost.But, the drawback of the system is, it generates more false positives as compared to other systems.Hence, it can be concluded that, naïve Bayes should be used along with other techniques for better results and reduction in false positives.
In [25], Sharmila et al. have presented the intrusion detection system using PCA based naïve Bayes algorithm.The results have shown that, better accuracy is achieved with PCA based naïve Bayes algorithm as compared to traditional naïve Bayes algorithm.This approach also helps to provide useful results even in presence of the missing values in the data sets.However, with increasing size of the data, the accuracy is decreased and speed of the system also slowdowns.Hence, naïve Bayes can be used with other techniques to achieve better results.

Intrusion detection system using artificial intelligence and deep learning techniques
In [26], Rassam et al. have proposed an IDS solution based on smart and generic rule construction.The smart rules are the rules which are formed as a single rule in place of 2-3 rules, which can detect multiple attacks.Hence, these are also named as generic rules.In the said work, first step is data preprocessing, followed by smart and generic rule construction, after which constructed rules learning is carried out.The advantage of the system is, because of construction of the smart rules, the smaller number of rules help to detect maximum number of attacks, hence less power consumption.But, this system is proposed considering, maximum attacks are from the internal systems of the network.So, it cannot be considered for outside incoming attack detections, as the size of the network is large for outside incoming attack detections.
In [27], Amudha et al. have presented the Intrusion detection system using hybrid swarm intelligence.The system is organized of two Intelligence algorithms, Particle Swarm Optimization (PSO) and Artificial Bee Colony (ABC).The data set used for the said work is KDDCUP 99.Initially the data preprocessing techniques are used for better results, feature selections are done by using SFSM (Single Feature Selection Method) and RFSM (Random Feature Selection Method).The PSO and ABC are applied as intelligence algorithms and hybrid model is developed, which gives the 99.5% accuracy for detection of known attacks.But, the problem is in detecting unknown attacks, which is not tested and presented by the researchers in this article.Also, the timeliness in detecting attacks is not considered while implementing this model.Hence, the model cannot deal with unknown attack identification and time of detection is also the concern that is required to be addressed.
In [28], Bahlali et al. have carried out the research work of anomaly based intrusion detection using machine learning techniques.The implementations and comparison of three machine learning techniques like logistic regression, decision tree and random forest, along with ANN in deep learning technique is presented in the work.The dataset USNW-NB15 is used, that suffers from issues such as imbalanced classes.Still, the accuracy claimed in the results by using these classifiers, is good.Among the used algorithms the ANN is found to be the best algorithm for accuracy of IDS model.The problem with the IDS model is performance in terms of timeliness, speed is not considered in the work.Also, the results can't be genuinely considered to compare the models, as dataset is old and does not reflect the new attacks.
In [29], Zamani has provided the detail study of design of intrusion detection system using machine learning techniques.The machine learning techniques are divided into two parts: Artificial Intelligence and Computational Intelligence.These methods share many features in common.It is claimed in the study that, an effective intrusion detection system can be designed using machine learning approach.It can allow to design a system which will be accurate, fault tolerant, efficient etc.
In [30], Dr. Malliga et al. have presented the network intrusion detection system for IoT systems using machine learning and deep learning algorithms.Naïve Bayes is used as a machine learning technique, in comparison with ANN and RNN as deep learning techniques.It is concluded that, Accuracy achieved using deep learning techniques like ANN and RNN is more than accuracy achieved using machine learning approach of naïve Bayes.The time requirement is also mentioned.But, here the important issue is, in deep learning, the amount of data required is huge in size and it has more time complexity of training as compared to machine learning techniques.Also these ANN and RNN techniques are blind folded techniques, the base for decision making in the model, is not transparent.Also, the model present, is for IoT and a sequential model.

Survey studies in intrusion detection system using machine learning and artificial intelligence (deep learning)
In Intrusion detection system, the major areas of concern are the quality attributes which cannot be compromised.The important measures are: accuracy, performance, completeness, fault tolerance, timeliness [31].These properties should be developed and considered while developing the better intrusion detection system.
In [32], Khan et al. have carried out the survey about performance of different decision tree algorithms in data mining.From the survey, the importance of the decision tree algorithms is highlighted.The performance of ID3, C4.5, CART decision tree algorithms is uncovered in the article.These algorithms are useful in decision tree based classification problems.These algorithms are used to construct the decision tree.In this process, the speed of the decision tree construction is the critical parameter.The performance of the algorithms in terms of speed is also presented in the said survey.Accordingly the C4.5 is the fastest algorithm among the given algorithms.This survey is useful for the researchers intending to deal with classification problem using decision tree.
In [33], Khairsat et al. have presented the survey of intrusion detection systems.The paper has outlined the Intrusion Detection Systems with their advantages and disadvantages, data sets, and different challenges in IDS model development.The IDS systems for zero-day attack identification are reviewed in the survey.It is found that IDS gives poor accuracy for new attacks detection.The survey has also examined the data sets and their effectiveness.The datasets used by different researchers for generating testing results for the IDS model does not consist of the new attacks.These datasets are developed in 1999, where are very old for testing new IDS systems developed in the recent years e.g., DARPA, KDD99 etc.Hence, the use of such old datasets leads to the inaccurate claims for effectiveness of the IDS systems and results cannot be considered as genuinely accurate results.
In [34], Saranya et al. have presented a survey about performance analysis of machine learning techniques in intrusion detection system.It briefs out various machine learning techniques and algorithms.It also explains the intrusion detection system in different application areas and their implementation using machine learning algorithms.The survey results states different IDS implementations using machine learning techniques like naïve Bayes, decision tree, SVM, kNN, k means, deep learning, ensemble learning, ANN, DBN etc.It is concluded by the survey study, that machine learning techniques are useful for developing accurate and effective intrusion detection system model as compared to other techniques.
In [35], Maniriho et al. have presented the survey of machine learning techniques for anomaly based intrusion detection system.The generic model of the IDS using machine learning techniques is explained, followed by the implementations of IDS using different machine learning techniques with the results.The results are generated using WEKA tool.The comparison is given among Random forest, decision stump, naïve Bayes and SGD algorithms.The conclusion is that, Random forest generates best results for intrusion detection as compared to other techniques used in the review.It is important to note here that, the methods and data set used is suitable for known attack detection, hence unknown attack detection is important but not considered here.Hence, the results cannot be generalized for new attack types, which are not present in the data set.Also, this survey does not outline the time performance of the algorithms in IDS, which is one of the important property to be considered in intrusion detection.
In [36], Haq et al. have presented the useful survey of machine learning techniques for intrusion detection systems.The study provides statistics about, number of IDS design attempts using machine learning approach.Few IDSs are designed using any of the standalone machine learning technique, few are designed using hybrid methods and few are designed using ensemble approach.It also states, dataset used, plays important role in generating results.Irrelevant and redundant features should be removed properly, best algorithm of feature selection should be used to avoid slowness of the system.And finally, it concludes that, hybrid approaches of implementation gives good results as compared to standalone technique based approaches.
In [37], Labonne et al. have presented a survey study of intrusion detection system using machine learning techniques.It briefs out the work that has been carried out over the years by different researchers.The study reveals that, better solution is possible with machine learning in intrusion detection.It is also concluded that, NSL KDD dataset is better than KDD99 data set and is used by many researchers also.
In [38], Mazini et al. have proposed the anomaly based intrusion detection using machine learning approach.The methods used for the design of this intrusion detection system is Artificial Bee Colony (ABC) and AdaBoost algorithms.The approach is sequential approach for the execution.The dataset used for testing of the system is NSL KDD dataset.The results obtained, are showing the better numbers as compared to legendary methods of machine learning.But, in this approach, the feature selection is used, which reduces many features based on the inappropriate data values of the features, and importance of the feature is not the criteria.Also, only classification of the known attack samples is given, hence accuracy cannot be generalized for unknown malicious behaviors.The approach is implemented in the simulation environment, where assumptions may not be suitable in real time environment.The efficiency or time complexities are given in sequential execution, which are not suitable to achieve the timeliness in the current systems as the data and connection requests per second are huge in numbers.
In [39], Amouri et al. have presented the intrusion detection system using machine learning techniques for mobile Internet of Things.The model is designed using regression techniques and feature selection is used for removing redundant, irrelevant features from the dataset.The results are generated using simulation based environment.The accuracy observed in detection is claimed as good accuracy, but major concern is false positive rate (FPR), it couldn't be kept at minimum and it has shown variation to a concern level.This model is designed for IoT systems with the limited scope.Hence, it has several assumptions in the specified scope and cannot be considered true in other application scenarios.
In [40], Aslam et al. have presented the hybrid approach for network intrusion detection system using machine learning approach and rules based system.The machine learning techniques like decision tree, Sequential Minimal Optimization and simple logistic is used for the hybrid approach with rule based system.The accuracy observed is for known attacks from the dataset and hence cannot be generalized for unknown attack identification.Also, model does not consider important properties like speed, timeliness and fault tolerance of the IDS system.

Definitions and techniques
From the survey, it is found that, using one of the methods in intrusion detection system is not sufficient for achieving better accuracy in detecting attacks.So, we are designing the intrusion detection system model using the combination of different machine learning techniques and distributed processing architecture as follows: 1. Support vector machine (SVM); 2. K nearest neighbor (kNN); 3. Decision tree (DT); 4. Naïve Bayes (NB); and 5. HDFS.
1. Support vector machine: Support vector machine which is abbreviated as SVM, is the Supervised Machine learning technique, used for classification of the data.In Support vector machines, the reproducible hyper plane is produced, which maximizes the margins between the classes [41].These margins include the boundary points of the classes, which are called as support vectors.Support vectors and hyper plane help to classify the data points into separate classes.The diagram showing support vectors and hyper plane separation is as shown in Fig. 1 as follows.SVM kernel is the important part of the technique which makes the classification of the data values into output classes separated by hyper plane.Kernel function is important as it transforms the training data so as the nonlinear decision surface is able to transform a linear equation into higher dimensional spaces.
In intrusion detection system, the data can be classified into two different classes by hyper plane.First class is the set of points with normal behavior i.e., normal connection request from genuine nodes, which can be categorized as normal input traffic.Second class is the set of points with unwanted behavior i.e., malicious connection request from malicious nodes, which can be categorized as intruder nodes.This is the effective machine learning technique to achieve accuracy as well as timeliness in the classification.
2. K nearest neighbor: k nearest neighbor is a classification technique used to classify the data points based on the distance between the data points.The distance formulaes that can be used are Euclidian distance (mostly used) [42], Manhattan distance formulas.Many times, it is important to carry out the data preprocessing before training the kNN model [42].In next step, the distance between the data point to be classified and the training data points is calculated.After, finding distance to all data points, the distance array is sorted and first k distance values are selected.Finally, Fig. 1 Support vector machine the data point will be classified to the class, whose maximum data points are present in first k distance values array, as these are the closest point with minimum distance from the data point to be classified [42].
In intrusion detection system, every incoming input traffic behavior is matched with the available data points in normal behavior and malicious behavior, and that will be categorized to the class, where the similarity is more and more points are closer to the new data point to be categorized.kNN algorithm is suitable in intrusion detection because of following important advantages: a. Accuracy; b.Time complexity/calculation time is less; c.Easy interpretation of the output; and d.Predictive power of kNN.

Decision tree: Decision
Tree is the classification technique, which unlike SVM, helps for multiclassification.Decision trees are also called as classification trees.The classification trees are used to classify the data points into classes which are belonging to the response variable [43].Decision trees are generated with the tree structured nodes, where the internal nodes are the conditional attribute nodes, based on which the path to the class is selected and the leaf nodes the class nodes which are the classification nodes or categorical nodes for the data points.The example of host based IDS decision tree is as shown in Fig. 2 [44].
Herein, the decision tree is very helpful for the intrusion detection system, as the attack type can be identified with the categorical or leaf nodes of the decision tree.Whenever, unknown attack input is encountered, the new leaf node will be created with the conditional attributes path and the decision tree is continuously updated for new attacks also.4. Naïve Bayes: Naïve Bayes is a classifier which provide better accuracy with less computational and storage requirements for classification of the data [45].Naïve Bayes classification is useful when the overall probability is the product of the independent probabilities of the variables.In naïve Bayes, it is assumed that, the features are independent of each other and they contribute to the classification independently.Naïve Bayes is effective to use when a large size data sets to deal with and also, it is easy to implement.
Figure 3 shows the probability expression for the naïve Bayes classification.It states that, there are two probabilities which are required to calculate the final probability of the object being classified to a specific class.The prior probability, likelihood and posterior probabilities are as shown in Fig. 3.The posterior probability is the probability of getting the object classified to that class.Here, in the technique, the posterior probability is calculated for every class for the given object and will be classified to the class with the maximum posterior probability.
5. HDFS: HDFS is a part of Hadoop.It is called as Hadoop distributed file system, which is very popular because of two important things-fault tolerance and distributed and parallel data processing architecture [46].Hadoop consists of two important things name node and data nodes.Each name node acts as a master and keeps the metadata of the data nodes and manages the several number of data nodes typically thousand in numbers.Such number of name nodes can be created in Hadoop.With this architecture, data can be divided into multiple blocks, and stored at multiple data nodes.This distributed data storage enables the parallel processing of the data, without any restriction on size of data processing.Hadoop is the massively parallel data processing architecture.The simple HDFS architecture for TP-IDS is as shown in Fig. 4 [46].In intrusion detection system, the fault tolerance and parallel processing to reduce the time complexity of the execution is very important feature, that can be easily achieved with the Hadoop or HDFS system [47], by using HDFS underlying data storage and processing architecture for all machine learning techniques.

Objectives
From, the literature study, few important things are uncovered, of which the solution in real time can be provided with different machine learning techniques.To develop the solution, the following objectives are defined and achieved: 1. To design the appropriate intrusion detection system architecture with the help of machine learning techniques.2. To develop the two phase intrusion detection system for increasing accuracy of the intrusion detection.3. To implement the intrusion detection model, which can detect the malicious activities in timeliness manner.4. To generate the fault tolerant intrusion detection system by using machine learning techniques and HDFS.
The objectives are defined by considering the current security requirements of the organizational networks.The effective and efficient intrusion detection system implementation is the overall objective of this research work.

Solution methodology
The model for anomaly intrusion detection system is presented in this context.The model is designed using machine learning techniques.From the survey study, it is observed that, the intrusion detection system model designed using either of the Fig. 4 HDFS architecture for TP-IDS machine learning techniques, does not provide desired features alltogether.With single technique as a part of the IDS model, disadvantages of the technique limits the performance potential of the system [48].Hence, it is required to bring multiple machine learning techniques together, which are complimentry to each other and design the intrusion detection system model which can provide all the required features as well as performance without limiting its performance potential.
So, as per the research gap, the four different machine learning techniques such as Support Vector Machine(SVM), k Nearest Neighbor (kNN), Decision Tree (DT), Naïve Bayes (NB) found useful and important to be a part of the accurate, efficient intrusion detection system along with massively parallel distribted data storage and processing system such as HDFS.The architecture of the system is as shown in Fig. 5.The architecture consists of two phases, in which first phase is for identifying the type of input, whether it is a normal input or anomaly input.The methods used for Phase I are SVM and kNN.SVM is the best method for known attack detection and it has high accuracy rates.kNN is a method which acts like a clustering technique, even for unknown or new input, it can provide faster and approximately accurate anomaly detection.Hence, here though SVM couldn't identify the unknown input sample correctly, kNN will provide idea about the type of input based on the similarity criteria, with more number of minimum distance points from a specific category or class to classify the input type.If, either or both of the SVM and kNN detects the incoming input traffic as anomaly, then the access will be blocked and the input traffic information will be passed to Phase II of the TP-IDS.Else, if both of these techniques identify the input traffic as normal connection request, then connection request will be accepted and access will be given.In this case, Phase II will not be executed.
In Phase II of the TP-IDS, the machine learning techniques used are decision tree and naïve Bayes.Here, the phase II is used as a validation Phase of TP-IDS.Here, again if either or both of the Decision tree and naïve Bayes detects the input traffic as anomaly, then the final output will be anomaly with type of attack identification, and network access will be blocked.Else, if both of these methods of phase II identify the input traffic as normal input, then network connection request will be accepted and access will be given.
Here, the Phase I and Phase II techniques are executed in underlying Hadoop distributed file system, which is distributed hence parallel and faster, fault tolerant.Also, with Phase II validation, we increase the accuracy and decrease the False Positive Rate (FPR) as well as False Negative Rate (FNR).By using, HDFS the two techniques used in each phase will be executed in parallel with parallel input processing in each technique.This enables to save the processing time and detection time of the TP-IDS model.This architecture helps to achieve the desired performance of the TP-IDS.

Dataset and data pre processing
The important thing to achieve the accuracy with any machine learning model, is the structure of the data set and quality of the data in the data set.Machine learning techniques have higher execution speed with NSL KDD dataset as compared to other techniques with the use of proper data preprocessing [49].Machine Learning is the only solution to provide efficient results in minimum time [50].Quality dataset is the data set, where noise is not present in the data values.Apart from this, it is also important that, only relevant attributes or features should be selected to train the model.When the irrelevant and redundant attributes are removed, the remaining set of the relevant features helps to increase the accuracy of the model.
In this intrusion detection model TP-IDS, the data set used is NSL KDD data set.The data sets prior to NSL KDD data set like KDD99, are the data sets which are unbalanced data set and affects the quality of the output during testing phase of the machine learning model.Hence, NSL KDD is the most popular data set now a days and it is used by many of the researchers who worked in the area of intrusion detection systems.NSL KDD data set consists of total 43 attributes, where one of the feature is attack type.It is the 42nd feature of the data set.It is the output variable for TP-IDS.
One of the important thing in the model training is data preprocessing.In the data preprocessing task, the feature selection method is used.Feature selection is important to remove the irrelevant features from the dataset, as these features affect the accuracy of machine learning model [51].With feature selection, the accuracies of different machine learning models vary significantly [52], hence correct feature selection algorithm is important.Along with big data environment, feature selection generate results with very less time requirement of execution [53].There are many different feature selection techniques like Single feature selection technique (SFST), Random feature selection technique (RFST), Correlation based feature selection technique (CFST), Principal Component Analysis which is slower and somewhat inconsistent [54] as compared to CFST.Among these techniques, Correlation based feature selection technique is found to be effective and useful in this application scenario of TP-IDS.Hence, the Correlation based feature selection is used for removing the redundant and irrelevant features from the dataset.SVM with feature selection generates high accuracy [55].In correlation based feature selection, the heuristic function evaluation is used to find the correlation [47].The equation is as follows: The features are considered irrelevant, when the correlation value is less than the threshold.When the value is more than it is relevant feature and can be considered in the model training.
With CFST, the 29 features are selected as relevant features with the attack type feature of NSL KDD and same are used for training and testing of the TP-IDS model.The same is shown in Fig. 6.

Results and disscussion
The TP-IDS model is implemented using R programming and Hadoop underlying distributed file system.The R programming is one of the popular statistical programming language specifically designed for data analytics and machine learning model building.Hadoop distributed file system has helped to increase the speed of the data processing by huge factor and could generate the excellent results.

Accuracy enhancement in TP-IDS
As shown in Table 1, the accuracy of the presented TP-IDS model is shown in comparison with the existing IDS systems using different machine learning, deep learning and artificial neural network techniques.The accuracy is measured as the number of accurate input traffic classifications into either attack type or normal input traffic divided by the total number input traffic samples provided to the model.From the accuracy results shown in the Table 1, it can be realized that, the accuracy results of TP-IDS model which is the combination of machine learning techniques in two phases gives better accuracy as compared to the existing IDS system approaches using standalone traditional machine learning techniques or deep learning techniques or Artificial neural network techniques.The accuracy is measured in percentage, TP-IDS has produced 99.93% accuracy for classification of the input network traffic into different attack types or normal traffic.

Timeliness in TP-IDS
It is also observed that, the output categorizations are generated in quick time as expected, because of the use of Hadoop distributed file system.It enables the parallel and faster data processing in TP-IDS model.The Fig. 7 shows the achieved timeliness of TP-IDS.In the given graph, the execution time is shown in milli seconds with respect to number of data records.It is realized that, the underlying HDFS component has increased the speed of the IDS model execution to the significant extent.
The distributed data nodes have processed data in parallel for the TP-IDS architecture using the HDFS system functionality.Due to parallel processing of the data, the machine learning techniques are executed within a very short time to generate the output of the TP-IDS.The Fig. 7 shows the progressive execution time with respect to number of input traffic samples passed to the system.The model is trained by using all samples given in the NSL KDD training data set i.e., total 1,25,973 records of input are passed for training the TP-IDS model.One of the reason for getting such high accuracy is also the size of the data input and the features passed to the model.The quality data has enabled to generate the highly accurate TP-IDS model.Also, SVM and Decision Tree are the most accurate machine learning algorithms for intrusion detection systems [56] and these algorithms are helping TP-IDS to increase the True positive rate to a significant level.

Conclusion and future work
So, hereby it is studied that, machine learning can give better results in building a models for intrusion detection systems.When different machine learning techniques are combined together to overcome the disadvantages of each other, one of the better intrusion detection systems like TP-IDS can be designed.By using the two phase model, with second phase as validation phase, the false positive rate and false negative rate is reduced by much amount and accuracy is increased by better scale.
Also, with the use of distributed processing data architecture like HADOOP-HDFS, the processing speed of the system is increased by massive amount, hence it helped to achieve the timeliness in the system TP-IDS.One of the important requirement such as fault tolerance is also achieved with the help of such distributed architecture.In future, the methods can be replaced with different methods from different machine learning techniques such as unsupervised or semi supervised category and new model with improved performances can be achieved.This research is funded by authors only.

Fig. 2
Fig. 2 Sample host based IDS decision tree

Table 1
Accuracy comparison of existing IDS systems and Our Model (TP-IDS)