- Open Access
Intrusion detection model using machine learning algorithm on Big Data environment
© The Author(s) 2018
- Received: 2 June 2018
- Accepted: 14 September 2018
- Published: 24 September 2018
Recently, the huge amounts of data and its incremental increase have changed the importance of information security and data analysis systems for Big Data. Intrusion detection system (IDS) is a system that monitors and analyzes data to detect any intrusion in the system or network. High volume, variety and high speed of data generated in the network have made the data analysis process to detect attacks by traditional techniques very difficult. Big Data techniques are used in IDS to deal with Big Data for accurate and efficient data analysis process. This paper introduced Spark-Chi-SVM model for intrusion detection. In this model, we have used ChiSqSelector for feature selection, and built an intrusion detection model by using support vector machine (SVM) classifier on Apache Spark Big Data platform. We used KDD99 to train and test the model. In the experiment, we introduced a comparison between Chi-SVM classifier and Chi-Logistic Regression classifier. The results of the experiment showed that Spark-Chi-SVM model has high performance, reduces the training time and is efficient for Big Data.
- Intrusion detection
- Big Data
- Apache Spark
- Support vector machine (SVM)
Big Data is the data that are difficult to store, manage, and analyze using traditional database and software techniques. Big Data includes high volume and velocity, and also variety of data that needs for new techniques to deal with it. Intrusion detection system (IDS) is hardware or software monitor that analyzes data to detect any attack toward a system or a network. Traditional intrusion detection system techniques make the system more complex and less efficient when dealing with Big Data, because its analysis properties process is complex and take a long time. The long time it takes to analyze the data makes the system prone to harms for some period of time before getting any alert [1, 2]. Therefore, using Big Data tools and techniques to analyze and store data in intrusion detection system can reduce computation and training time.
The IDS has three methods for detecting attacks; Signature-based detection, Anomaly-based detection, and Hybrid-based detection. The signature-based detection is designed to detect known attacks by using signatures of those attacks. It is an effective method of detecting known attacks that are preloaded in the IDS database. Therefore, it is often considered to be much more accurate at identifying an intrusion attempt of known attack . However, new types of attack cannot be detected as its signature is not presented; the databases are frequently updated in order to increase their effectiveness of detections . To overcome this problem Anomaly-based detection that compares current user activities against predefined profiles is used to detect abnormal behaviors that might be intrusions. Anomaly-based detection is effective against unknown attacks or zero-day attacks without any updates to the system. However, this method usually has high false positive rates [5, 6]. Hybrid-based detection is a combination of two or more methods of intrusion detection in order to overcome the disadvantages in the single method used and obtain the advantages of two or more methods that are used. Many researches proposed machine learning algorithm for intrusion detection to reduce false positive rates and produce accurate IDS. However, to deal with Big Data, the machine learning traditional techniques take a long time in learning and classifying data. Using Big Data techniques and machine learning for IDS can solve many challenges such as speed and computational time and develop accurate IDS. The objective of this paper is to introduce Spark Big Data techniques that deal with Big Data in IDS in order to reduce computation time and achieve effective classification. For this purpose, we propose an IDS classification method named Spark-Chi-SVM. Firstly, a preprocessing method is used to convert the categorical data to numerical data and then the dataset is standardization for the purpose of improving the classification efficiency. Secondly, ChiSqSelector method is used to reduce dimensionality on the dataset in order to further improve the classification efficiency and reduce of computation time for the following step. Thirdly, SVM is used for the data classification. More specifically, we use SVMWithSGD in order to solve the optimization, in addition, we introduce comparison between SVM classifier and Logistic Regression classifier on Apache Spark Big Data platform based on area under curve (AUROC), Area Under Precision-Recall curve (AUPR) and time metrics. The KDDCUP99 are tested in this study.
The rest of this work is organized as follows: A review of relevant works is conducted in “Related works” section. In “Methods” section, we introduced the proposed method. Also, each step in this method are described. Results and experiment settings are mentioned in “Result and discussion” section. Finally, we conclude our work and describe the future work in “Conclusion” section.
There are many types of researches introduced for intrusion detection system. With emerge of Big Data, the traditional techniques become more complex to deal with Big Data. Therefore, many researchers intend to use Big Data techniques to produce high speed and accurate intrusion detection system. In this section, we show some researchers that used machine learning Big Data techniques for intrusion detection to deal with Big Data. Ferhat et al.  used cluster machine learning technique. The authors used k-Means method in the machine learning libraries on Spark to determine whether the network traffic is an attack or a normal one. In the proposed method, the KDD Cup 1999 is used for training and testing. In this proposed method the authors didn’t use feature selection technique to select the related features. Peng et al.  proposed a clustering method for IDS based on Mini Batch K-means combined with principal component analysis (PCA). The principal component analysis method is used to reduce the dimension of the processed dataset and then mini batch K-means++ method is used for data clustering. Full KDDCup1999 dataset has been used to test the proposed model.
Peng et al.  used classification machine learning technique. The authors proposed an IDS system based on decision tree over Big Data in Fog Environment. In this proposed method, the researchers introduced preprocessing algorithm to figure the strings in the given dataset and then normalize the data to ensure the quality of the input data so as to improve the efficiency of detection. They used decision tree method for IDS and compared this method with Naïve Bayesian method as well as KNN method. The experimental results on KDDCUP99 dataset showed that this proposed method is effective and precise. Belouch et al.  evaluated the performance of SVM, Naïve Bayes, Decision Tree and Random Forest classification algorithms of IDS using Apache Spark. The overall performance comparison is evaluated on UNSW-NB15 dataset in terms of accuracy, training time and prediction time. Also, Manzoor and Morgan  proposed real-time intrusion detection system based on SVM and used Apache Storm framework. The authors used libSVM and C-SVM classification for intrusion detection. The proposed approach was trained and evaluated on KDD 99 dataset. In addition, Features selection techniques were used in a lot of researches. PCA Features selection technique implemented in some proposed IDSs like Vimalkumar and Randhika  proposed Big Data framework for intrusion detection in smart grid by using various algorithms like a Neural Network, SVM, DT, Naïve Bayes and Random Forest. In this approach, a correlation-based method is used for feature selection and PCA is used for dimensionality reduction. The proposed approach aimed to minimize the time of predicting attack and also to increase the accuracy of the classification task. This approach used Synchrophasor dataset for training and evaluation. The results of this proposed approach are compared by accuracy rate, FPR, Recall and specificity evaluation metrics. Dahiya and Srivastava  proposed a framework for fast and accurate detection of intrusion using Spark. In the proposed framework was used Canonical Correlation Analysis (CCA) and Linear Discriminant Analysis (LDA) algorithms for feature reduction, and seven classification algorithms(Naïve Bayes, REP TREE, Random Tree, Random Forest, Random Committee, Bagging and Randomizable Filtered). In the proposed work the two sets of UNSW-NB 15 dataset was used to evaluate the performance of all classifiers. The experiment result of the proposed method found the LDA and random tree algorithm approach is more effective and fast. The Results showed that AUROC = 99.1 for dataset1 and 97.4 for dataset2. In our model, we obtained the results of AUROC = 99.55. Therefore, our model is more effective and fast. Hongbing Wang et al.  proposed a parallel principal component analysis (PCA) combined with parallel support vector machine (SVM) algorithm based on the Spark platform (SP-PCA-SVM). PCA is used for analyzing data and feature extract for dimensionality reduction based on Bagging. The proposed approach used KDD99 for training and evaluation.
Related work comparative
Big Data tool
SVM, Naïve Bayes, Decision Tree and Random Forest
Neural network, SVM, DT, Naïve Bayes and Random forest
Naïve Bayes, REP TREE, Random Tree, Random Forest, Random Committee, Bagging and Randomizable Filtered
Parallel Naïve Bayes
The researchers are still seeking to find an effective way to detect the intrusions with high performance, high speed and a low of false positive alarms rate. The main objective of this paper is to improve the performance and speed of intrusion detection within Big Data environment. In this method, the researchers used Apache Spark Big Data tools because it is 100 times faster than Hadoop , the feature selection that takes the amount of computation time, and this time can be reduced when using SVM on KDD datasets . Therefore, we used SVM algorithm with Chi-squared for feature selection and compared it with Logistic Regression classifier based on area under curve (ROC), Area Under Precision Recall Curve and time metrics.
Spark Chi SVM proposed model
Load dataset and export it into Resilient Distributed Datasets (RDD) and DataFrame in Apache Spark.
Train Spark-Chi-SVM with the training dataset.
Test and evaluate the model with the KDD dataset.
KDD99 dataset attributes
Large-scale datasets usually contain noisy, redundant and different types of data which present critical challenges to knowledge discovery and data modeling. Generally, the intrusion detection algorithms deal with one or more of the raw input data types such as SVM algorithm that deals with numerical data only. Hence, we prepare data and convert categorical data in the dataset to numerical data.
The result of standardization
The dataset record
The record before standardization
res1:org.apache.spark.mllib.regression.LabeledPoint = (1.0,[0.0,181.0,5450.0,0.0,0.0,0.0,0.0,0.0,1.0, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,8.0,8.0, 0.0,0.0,0.0,0.0,1.0,0.0,0.0,9.0,9.0,1.0,0.0,0.11, 0.0,0.0,0.0,0.0,0.0])
The record after standardization
res2:org.apache.spark.mllib.regression.LabeledPoint = (1.0,[0.0,1.8315794844034117E-4,0.16495156759878019, 0.0,0.0,0.0,0.0,0.0,2.814168444874875, 0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0, 0.03753270996838475,0.03247770581832668,0.0,0.0,0.0,0.0, 2.576061480099788,0.0,0.0,0.13900605646702138, 0.0848732827397667,2.434387313317322,0.0, 0.22854329046843286,0.0,0.0,0.0,0.0,0.0])
AUROC result based on numTopFeatures
AUROC result (%)
Results of Spark-Chi-SVM model
Training and predict time results
In this paper, the researchers introduced Spark-Chi-SVM model for intrusion detection that can deal with Big Data. The proposed model used Spark Big Data platform which can process and analyze data with high speed. Big data have a high dimensionality that makes the classification process more complex and takes a long time. Therefore, in the proposed model, the researchers used ChiSqSelector to select related features and SVMWithSGD to classify data into normal or attack. The results of the experiment showed that the model has high performance and speed. In future work, the researchers can extend the model to a multi-classes model that could detect types of attack.
SMO took on the main role performed the literature review, implemented the proposed model, conducted the experiments and wrote manuscript. FB-A took on a supervisory role and oversaw the completion of the work. NTA reviewed the manuscript language and helped in edit the manuscript. AA-H helped in edit the manuscript, All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Availability of data and materials
All data used in this study are publicly available and accessible in the cited sources. KDD Dataset: including details of Dataset that used in experiment see the web site: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Consent for publication
The authors consent for publication.
Ethics approval and consent to participate
The authors Ethics approval and consent to participate.
The authors declare that they have no funding.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
- Tchakoucht TA, Ezziyyani M. Building a fast intrusion detection system for high-speed-networks: probe and DoS attacks detection. Procedia Comput Sci. 2018;127:521–30.View ArticleGoogle Scholar
- Zuech R, Khoshgoftaar TM, Wald R. Intrusion detection and big heterogeneous data: a survey. J Big Data. 2015;2:3.View ArticleGoogle Scholar
- Sahasrabuddhe A, et al. Survey on intrusion detection system using data mining techniques. Int Res J Eng Technol. 2017;4(5):1780–4.Google Scholar
- Dali L, et al. A survey of intrusion detection system. In: 2nd world symposium on web applications and networking (WSWAN). Piscataway: IEEE; 2015. p. 1–6.Google Scholar
- Scarfone K, Mell P. Guide to intrusion detection and prevention systems (idps). NIST Spec Publ. 2007;2007(800):94.Google Scholar
- Debar H. An introduction to intrusion-detection systems. In: Proceedings of Connect, 2000. 2000.Google Scholar
- Ferhat K, Sevcan A. Big Data: controlling fraud by using machine learning libraries on Spark. Int J Appl Math Electron Comput. 2018;6(1):1–5.View ArticleGoogle Scholar
- Peng K, Leung VC, Huang Q. Clustering approach based on mini batch Kmeans for intrusion detection system over Big Data. IEEE Access. 2018.Google Scholar
- Peng K. et al. Intrusion detection system based on decision tree over Big Data in fog environment. Wireless Commun Mob Comput. 2018. https://doi.org/10.1155/2018/4680867.View ArticleGoogle Scholar
- Belouch M, El Hadaj S, Idhammad M. Performance evaluation of intrusion detection based on machine learning using Apache Spark. Procedia Comput Sci. 2018;127:1–6.View ArticleGoogle Scholar
- Manzoor MA, Morgan Y. Real-time support vector machine based network intrusion detection system using Apache Storm. In: IEEE 7th annual information technology, electronics and mobile communication conference (IEMCON), 2016. Piscataway: IEEE. 2016; p. 1–5.Google Scholar
- Vimalkumar K, Radhika N. A big data framework for intrusion detection in smart grids using Apache Spark. In: International conference on advances in computing, communications and informatics (ICACCI), 2017. Piscataway: IEEE; 2017. p. 198–204.Google Scholar
- Dahiya P, Srivastava DK. Network intrusion detection in big dataset using Spark. Procedia Comput Sci. 2018;132:253–62.View ArticleGoogle Scholar
- Wang H, Xiao Y, Long Y. Research of intrusion detection algorithm based on parallel SVM on Spark. In: 7th IEEE International conference on electronics information and emergency communication (ICEIEC), 2017 . Piscataway: IEEE; 2017. p. 153–156.Google Scholar
- Natesan P, et al. Hadoop based parallel binary bat algorithm for network intrusion detection. Int J Parallel Program. 2017;45(5):1194–213.View ArticleGoogle Scholar
- Akbar S, Rao TS, Hussain MA. A hybrid scheme based on Big Data analytics using intrusion detection system. Indian J Sci Technol. 2016. https://doi.org/10.17485/ijst/2016/v9i33/97037 View ArticleGoogle Scholar
- Zaharia M, et al. Apache spark: a unified engine for big data processing. Commun ACM. 2016;59(11):56–65.View ArticleGoogle Scholar
- Chambers MZaB. Spark: The Definitive Guide: O?Reilly Media, Inc. , 1005 Gravenstein Highway North, Sebastopol, CA 95472. 2017.Google Scholar
- Kato K, Klyuev V Development of a network intrusion detection system using Apache Hadoop and Spark. In: IEEE conference on dependable and secure computing, 2017 .Piscataway: IEEE. 2017; p. 416–423.Google Scholar
- Deng Z, et al. Efficient kNN classification algorithm for big data. Neurocomputing. 2016;195:143–8.View ArticleGoogle Scholar
- Sung AH, Mukkamala S. The feature selection and intrusion detection problems. In: ASIAN. Berlin: Springer; 2004. p. 468–482.View ArticleGoogle Scholar
- Cortes C, Vapnik V. Support-vector networks. Mach Learn. 1995;20(3):273–97.MATHGoogle Scholar
- Cherkassky V, Ma Y. Practical selection of SVM parameters and noise estimation for SVM regression. Neural Netw. 2004;17(1):113–26. https://doi.org/10.1016/S0893-6080(03)00169-2.View ArticleMATHGoogle Scholar
- Karamizadeh S. et al. Advantage and drawback of support vector machine functionality. In: International conference on computer, communications, and control technology (I4CT), 2014. Piscataway: IEEE. 2014; p. 63–65.Google Scholar
- Enache A-C, Sgârciu V. Enhanced intrusion detection system based on bat algorithm-support vector machine. In: 11th international conference on security and cryptography (SECRYPT), 2014 . Piscataway: IEEE; 2014. p. 1–6.Google Scholar
- Bhavsar H, Ganatra A. A comparative study of training algorithms for supervised machine learning. Int J Soft Comput Eng (IJSCE). 2012;2(4):2231–307.Google Scholar
- Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit. 1997;30(7):1145–59.View ArticleGoogle Scholar
- Gupta GP, Kulariya M. A framework for fast and efficient cyber security network intrusion detection using Apache Spark. Procedia Comput Sci. 2016;93:824–31.View ArticleGoogle Scholar
- Kulariya M. et al. Performance analysis of network intrusion detection schemes using Apache Spark. In: International conference on communication and signal processing (ICCSP), 2016. Piscataway: IEEE; 2016. p. 1973–1977.Google Scholar