Analysis of Bayesian optimization algorithms for big data classification based on Map Reduce framework

The process of big data handling refers to the efficient management of storage and processing of a very large volume of data. The data in a structured and unstructured format require a specific approach for overall handling. The classifiers analyzed in this paper are correlative naïve Bayes classifier (CNB), Cuckoo Grey wolf CNB (CGCNB), Fuzzy CNB (FCNB), and Holoentropy CNB (HCNB). These classifiers are based on the Bayesian principle and work accordingly. The CNB is developed by extending the standard naïve Bayes classifier with applied correlation among the attributes to become a dependent hypothesis. The cuckoo search and grey wolf optimization algorithms are integrated with the CNB classifier, and significant performance improvement is achieved. The resulting classifier is called a cuckoo grey wolf correlative naïve Bayes classifier (CGCNB). Also, the performance of the FCNB and HCNB classifiers are analyzed with CNB and CGCNB by considering accuracy, sensitivity, specificity, memory, and execution time.

processing is because data arrives in a continuous pattern from the internet sources [10,11]. Data analytics is considered one of the challenging research problems, and data mining and machine learning techniques are used to perform analytics. The inherent complexities associated with the data are its large size, and non-uniform makes the limitations on the current data mining software tools and techniques [12]. The standard data mining and machine learning methods are not directly dealing with extensive size data effectively [13,14]. The analysis and knowledge extraction process related to big data can be further improved.
The two major categories, namely clustering, and classification are included in data mining schemes. The big data classification process is primarily [15,16] influenced by the various classifiers, such as Naïve Bayes [17], Support Vector Machine [18], and Extreme Learning Machine [19]. The data classification approach provided by ELM [10] algorithm is based on multiclassification rather than binary classification [20]. The observed fact about big data processing is the increased computational complexity because of the high volume [21]. The analysis of big data in supervised classification is based on the learning algorithms, and after that, it finds the appropriate classes for the datasets [22].
The analytics include the extraction of useful insights from the extensive data, and one of the major tasks associated with it is classification [23,24]. The deep study suggests the redesign of typical classification algorithms be adopted to classify very large amounts of data. The techniques are required to classify large amounts of data for different applications. The redesign of a classification technique for large amounts of data needs to be taken care of after that; the applications that use these algorithms can limit the growth of the extensive size of data [25].
The analysis and organization techniques applicable to traditional systems are insufficient in addressing big data-related issues and challenges [26,27]. There are several algorithms developed for performing big data analysis tasks. The compatibility of MapReduce in handling big data processing makes it most suitable for analysis tasks [28].
The problem of processing large-size data is solved using the Map-Reduce scheme, which contains Mapper and Reducer tasks in parallel on datasets [29].
The Map-Reduce principle is derived from the divide and conquer strategy of problem-solving, in which sub-data samples are created and handled independently by splitting the data samples [30,31]. The Map level divides the source data by producing dissimilar pairs of the key value. The efforts obtained from the key value in the map function are integrated by reducing functions at the Reduce level [32].
The number of smart data analytics approaches such as image processing, multitemporal processing, and automatic classification is increasingly in demand due to big data processing [33]. The data mining approaches integrated with the recent growing technologies are used for performing big data analytics with reduced limitations and drawbacks [34]. The MapReduce technique and distributed file system introduced by Google provides a suitable environment for processing large-scale datasets over a group of systems. The MapReduce framework is used to perform big data mining using several processors efficiently [12]. Generally, Hadoop, one of the systems, provides a parallel programming environment [35] used to execute the MapReduce framework.
The classification algorithms are employed to solve data mining issues in big data because of the fact that it is the gathering of data from different sources. The classification task is based on building a classifier model for resultant target class prediction of data items included in the dataset [36]. The emphasis on data classification motivates the adaptive use of techniques for classification in big data analysis. The classification techniques, including Bayes networks, genetic algorithms, genetic programming, and decision trees [37], are adopted to perform the classification of large data.
It is found in the literature that big data classification is performed using machine learning methods [38,39], optimization algorithms [32,40], Decision Tree [41], primarily along with some other approaches. The inclusion of fuzzy theory with correlative naïve Bayes classifier creates a model named the Fuzzy Correlative Naive Bayes Classifier (FCNB). The FCNB is developed for big data classification because it is implemented using the MapReduce framework. Further enhancement of the CNB classifier is done using the holoentropy function. The resultant model developed for big data classification using the MapReduce and named as Holoentropy based Correlative Naive Bayes Classifier (HCNB).
The main contribution of this paper is the analysis of big data classification techniques based on the Map-Reduce model using the classifiers, such as Correlative naïve Bayes classifier (CNB), Cuckoo Grey wolf CNB (CGCNB), Fuzzy CNB (FCNB), and Holoentropy CNB. (HCNB). The performance of the classifiers is evaluated based on accuracy, sensitivity, specificity, memory, and time. At first, CNB is compared with the existing naïve Bayes classifier. After that, further analysis shows the significant performance improvement of CGCNB in comparisons with NB and CNB. The other two developed models for big data classification named FCNB and HCNB are compared with Naïve Bayes [24], Correlative Naive Bayes (CNB) [20], Cuckoo Grey Wolf based CNB (CGCNB), and Fuzzy Naïve Bayes classifier (FNB) [24]. The classifiers are implemented in the JAVA programming language. The localization dataset and cover type dataset are taken from the UCI machine learning repository for experimentation. This paper is organized into different sections: "Literature review" section discusses the Literature review covering the number of techniques developed using the MapReduce framework to perform classification on big data. "Descriptions of Bayesian classification methods" section describes the developed Bayesian classifiers, namely Correlative Naïve Bayes, Cuckoo Search Grey Wolves Correlative naïve Bayes, Fuzzy Correlative Naïve Bayes, and Holoentropy based correlative naïve Bayes classifiers. The developed classifiers' analysis is presented with comparisons based on the obtained results of accuracy, sensitivity, specificity, memory, and time is shown in "Results and discussion" section. Finally, the conclusion and future scope for improvements of the classifiers discussed in "Conclusion" section

Literature review
In this section, various algorithms for performing big data classification by different researchers and their importance in terms of advantages and disadvantages are presented.
Shen and Kai Gao [42] created a technique to be adopted for big data classification, and it works in the internet environment. The performance of this approach was stable, and also it can reduce the computation complexity. The approach has lower complexity than the traditional segmentation approaches, but one of the demerits of this approach is its strongest feature independence.
Simone Scardapane et al. [43] designed an algorithm for big data classification known as Echo State Networks based on utilizing the multipliers optimization procedure. The method performed the local exchanges among the nearby elements by utilizing the multiplier optimization procedure without connecting nodes. In addition to this, the communicating nodes did not require to use of training patterns. The synthetic datasets were used in experimentation and showed improvements in computation time and generalization accuracy metrics. In this method, pilot symbols are not required during communication. In contrast, the weights depend on error values is one of the drawbacks of this approach.
Jemal H. Abawajy et al. [44] implemented large, iterative multitier ensemble (LIME) classifiers and analyzed their suitability for big data classification. This classifier integrated varieties of meta-classifiers at lower levels and an iterative system developed at the higher level. The experimentation results showed that the developed classifier has improved performance compared with existing classifiers for big data. The computational cost was a major drawback of the developed ensemble classifier.
Xin et al. [45] integrated Map-Reduce framework and extreme learning machine to develop a classifier to work in a distributed environment. In the initial stage of this classifier, learning weights are computed through the multiplication of data matrices. Further stage proposed elastic ELM with Map Reduce technique. This approach's positive side was the rapid training speed and negligible human intervention because of its capability of efficient learning using dynamic processing massive training dataset.
Victoria López et al. [13] proposed an algorithm based on fuzzy rule-based classification named Chi-FRBCS-BigDataCS to deal with big imbalanced data. This algorithm dealt with the uncertainty in big data, with the strength of learning in the minimal class. This algorithm performs the fuzzy logic processes and MapReduce concept for design processes using the cost-sensitive learning approach. The performance of this algorithm is significantly improved in terms of classification accuracy and computation time.
Reshma C. Bhagat and Sachin S. Patil [46] performed a classification of imbalanced data and created a multi-classifier. At its first step, they utilized the techniques for conversion of the original datasets in the small group containing binary classes. In the next step, balanced classes were formed from the imbalanced binary class using the SMOTE algorithm. The classification was performed in the final step by utilization of the Random Forest classifier. The oversampling issues in this algorithm were handled using the MapReduce concept so that the developed method acquired the capability to handle large datasets. The fixing of dataset dependent oversampling rates was one of the problems with this algorithm.
Triguero et al. [47] used the evolutionary under-sampling method and developed a parallel model for handling big data and its related issues. Processes involved in the classification task were distributed to various cluster computing nodes due to the MapReduce technique for accomplishing classification work effectively. In addition, they also introduced a windowing approach for handling imbalanced class data, and it also increases the under-sampling speed. The advantage of this model was the scalability achievement during experimental studies. The abrupt change in computation time based on the imbalance ratio was the considerable drawback of this method.
Alessio Bechini et al. [48] developed a classifier by combining Map-Reduce concept with association rules and then named it MapReduce-based Associative Classifier (MRAC). This classifier utilized distributed classification method with the Map-Reduce model of programming and association rules. The method uses the FP-Growth algorithm and mined Classification Association Rules (CARs) and handled a distributed rule pruning classifying unlabeled patterns. The advantage of this method was related to speed up and scalability, but it failed when performed experiments on large datasets.
Shichao Zhang et al. [49] utilized the combination of clustering and classification approaches of K-Nearest Neighbors (KNN) and k-means clustering. The k-means clustering was used for dividing the entire dataset into various parts, and kNN classification was performed in each part of the cluster. The results of the experiments performed on medical images and big data showed between efficiency and accuracy. This approach had a drawback of selecting the value of k because of the degradation of classification accuracy when the value of k was increased.
The nature-inspired meta-heuristic algorithm developed by Seyedali Mirjalili et al. [50] which was based on the social and hunting behavior of grey wolves. The algorithm was adopted in computer science for the optimization process and named Grey Wolves Optimization algorithm (GWO). The grey wolves' leadership hierarchy and hunting technique were observed, and accordingly, the GWO algorithm was framed. The GWO algorithm has superior performance improvements when compared with other algorithms. The algorithms compared with GWO were Evolution Strategy (ES), Evolutionary Programming (EP), Differential Evolution (DE), Gravitational Search Algorithm (GSA), and Particle Swarm Optimization (PSO).
Cuckoo search is another meta-heuristic algorithm developed by Xin-She Yang and Suash Deb [51] to provide an approach towards the solution of optimization problems. The algorithm combines the cuckoo species' brood parasitic behavior [52] with the Levy flight behavior of some birds and fruit flies. The experimental results of this algorithm showed improved performance as compared with the genetic algorithm and particle swarm optimization techniques.
Simon Fong et al. [53] developed a mining method from streaming data while avoiding the computational challenges. The developed algorithm was based on lightweight feature selection approaches and used to avoid high dimensionality of the accepted data mining. This method was suitable for real-time applications and showed effectiveness in terms of high accuracy, low latency, and robustness for solving big data problems. S.Md. Mujeeb et al. [54] developed Big data classification based on MapReduce Framework for effective data management using the E-Bat algorithm. The performance is analyzed based on accuracy and Total Positive rate (TPR). The data management is better but, the accuracy obtained is not high. Shichao Zhang et al. [55] developed big data classification using kNN (k nearest neighbors). It is well known easy learning algorithm for real time applications and efficient algorithm. However, the selection of k value must be very small to get better accuracy. Such selection is not possible for all data sets. William et al. [56] developed multi-class imbalanced big data classification on Spark. In this, the clustering-based data partitioning is done to eliminate the big data classification problems.
The Random Forest and Naive Bayes classification method is used. They obtained better predictive power for the distributed environment. The major drawback of this method is its time complexity. Selvi et al. [57] developed Optimal Feature Selection for big data classification. The Firefly with Lion-Assisted Model is employed for the feature selection. The model obtains the effective classification, but it is not applicable for the noisy data. Mujeeb et al. [58] developed deep learning for big data classification. The Adaptive Exponential Bat algorithm is devised for training the classifier. They obtained better accuracy, TPR, and TNR. The security constraints are the major drawback of this method.
The major challenges in the existing techniques are computational complexity, time, cost, oversampling, and speed. These drawbacks were overcome by using the Bayesian classifiers. The CNB classifier is well suited for imbalanced datasets. By using the optimization algorithms, better convergence is obtained with improved accuracy with low computational complexity.

Naïve Bayes classifier (NB)
The different classifiers based on the Bayesian theorem are known as Bayesian classifiers; for example, NB is one of the bayesian classifiers. The Bayesian classification is based on the posterior probability calculation by assumed prior probabilities and the probability of different data object under given assumptions.
The NB classifier is based on the fact that each attribute of the object to be classified is independent of each other. The NB classifier is based on the approach where the probability of each category is calculated, and the object belongs to the category with the largest probability associated.

Correlative Naive Bayes classifier (CNB)
One of the highly utilized classifiers is NB classifier, and the typical classifier is adopted with the Map-Reduce framework and used for big data classification [59]. At the initial phase of the training process, the input data are arranged in different groups based on the number of classes.
where, µ Q×m is the mean to be calculated, σ Q×m is specified as variance, R Q×1 denotes correlation function, and it is illustrated in vector form. Testing data result is represented using the following equation: The Eq. (2) indicates that the highest posterior value is only selected as a consequential class.

Cuckoo Grey Wolf optimization with correlative Naïve Bayes classifier (CGCNB)
The integrated CGWO and CNB classifier with MapReduce framework is named as CGCNB-MRP [60]. CGWO is the integration of the cuckoo search algorithm and grey wolf Optimization (GWO). The block diagram of the developed model for big data classification is depicted in Fig. 1.

Fuzzy correlative Naive Bayes classifier (FCNB)
One more extension of CNB classifier is the inclusion of fuzzy theory and it is named as FCNB [61]. FCNB classifier is shown in the following equation: The term µ s q represents membership degree of symbol sth in the qth element of the training model. The whole incidence of sth symbol in qth element is represented by the term m s q and d is utilized for representing the data sample in the attribute. Each data sample is classified into the number of classes let it be called K. It is represented as follows: Here the whole incidence of Kth class in ground truth information is represented by |m k | . The outcome of FCNB classifier is represented as below: In the developed FCNB classifier's testing stage, the testing sample is classified into the appropriate classes by using the posterior probability of naïve Bayes, fuzzy membership degree, and a correlation factor among attributes. The output of FCNB is expressed as below equation: The term here is represented as P(g k | X) denoted as a posterior probability by using test data X for given class g k . The expression C k signifies correlation for class K.

Holoentropy using correlative naïve bayes classifier for a big data classification (HCNB)
The new classification technique called HCNB [62] is introduced by combining the existing CNB classifier with the holoentropy function. The handling is based on the holoentropy estimation for each attribute using the following formula: Here, F represents the weight function, and T(i b ) is the entropy. The formulae for the weight function and entropy is described in the following equations.
Here, M(i b ) is the unique value of the attribute vector i b . The training phase of the HCNB based on the training data samples produces the result in the following vector form: Here, µ a×s and σ 2 a×s represent the computed mean value and computed variance value between the attributes a and s, respectively. Also, the correlation is represented by C ax1, and the holoentropy function is represented using H a×s . The individual class is selected by estimating posterior probability independently during a testing phase, which the below expression can represent: where, y b illustrates yth data of bth element, and k v indicates vth class number. The block diagram of the HCNB classifier is depicted in Fig. 2.

Results and discussion
This section presents the classifiers' evaluation results, and a comparative detail analysis is provided with the existing methods. The system requirements and implementation details are provided in the experiment setup.

Experimental setup
The system configuration used for performing the experiment contains Windows 10 OS running on Intel processor CPU 2.16 GHz with 2 GB RAM capacity. The methods included in the developed classifiers are implemented in JAVA programming language. The parameters used for the experimentation are maximum iteration-5, population size-6, and mapper size-5. .

Dataset description
The dataset utilized for the experimentation purpose is localization dataset and cover type dataset.
The localization dataset is taken from the UCI machine learning repository for experimentation [63]. The recorded activities of five people wearing tags, ankle left, ankle right, belt, and chest are collected in terms of the localization dataset. The dataset contains a total of 164,860 instances that include 8 attributes. Each instance in the dataset forms localization data for tags, and attributes are used to recognize them.
The cover type dataset is taken from the UCI machine learning repository for experimentation [64]. The dataset contains total of 581,012 instances with 54 attributes.

Metrics used for performance evaluation
Five metrics, such as accuracy, sensitivity, specificity, memory, and time, are used to evaluate the performance of the classifiers. The degree of veracity is measured using an accuracy metric defined as the proportion of true results. The sensitivity and specificity are referred to as the proportion of correctly identified true positives and true negatives.
where, T N is True Negative, T P is True Positive, F N is False Negative, and F P is False Positive.

Comparative analysis of Bayesian models
A comparative analysis is done to evaluate the developed classifiers with the existing models based on sensitivity, specificity, and accuracy. In this paper, the first model C.N.B. is compared with the existing naïve Bayes classifier. After that, further analysis shows the significant performance improvement of CGCNB in comparisons with NB and CNB. The other two developed models for big data classification named FCNB and HCNB are compared with Naïve Bayes [24], Correlative Naive Bayes (CNB) [20], Cuckoo Grey Wolf based CNB (CGCNB), and Fuzzy Naïve Bayes classifier (FNB) [24].
The naïve Bayes classifier incorporating the fuzzy theory is named the Fuzzy Naïve Bayes classifier (FNB); the comparison of FCNB classifier shows enhanced performance achievement and adopted to perform the big data classification.
The holoentropy extension of the CNB classifier (HCNB) is created and compared with the existing models with similar conditions and parameters and observed improvements when its performance is assessed on the localization dataset and the cover type dataset. The comparative study of the models is done for training sample data taken from 75 to 90% for the number of number mappers as M = 5. The number of mappers represents the number of desktops used for simulation.

Performance evaluation of different Bayesian classifiers
The developed classifiers CNB and CGCNB are evaluated based on accuracy, sensitivity, specificity, memory, and time analysis on the localization dataset. The performance evaluation is presented in this section. The performance evaluation process is carried out with a mapper size of 5 and on training data. Table 1 shows the analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB classifiers based on localization data.

Analysis based on training percentage
Consider Table 1, the accuracy of the NB classifier with 75% of training data is 76.3% while increasing the training percentage the accuracy of the classifier goes on increasing, similarly for all the matrices like sensitivity, specificity, memory and time the improved performance is achieved with an increase in training percentage. This improved performance with the increase in training percentage is achieved in all the classifiers.

Analysis based on mappers:
In Table 1, when considering the HCNB classifier for mapper size 2 with a training percentage of 75%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 85, 94, 89.2, 166, and 1.99, respectively. Similarly, for mapper size 5 with a training percentage of 75%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 85.3, 94.2, 89.3, Table 1 Analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB based on localization data Training data (%)  122, and 1.81. From the above analysis, we can interpret that the execution time and memory decrease with the increase in the mapper size. In this proposed work, the mapper size depicts the number of desktops used.

Performance evaluation of different Bayesian classifiers
The performance evaluation using the cover type dataset is presented in this section. The developed classifiers are evaluated based on accuracy, sensitivity, specificity, memory, and time. The performance evaluation process is carried out by varying the number of mappers and the training data. The number of mappers used is 5, representing the number of desktops used for simulation of big data analysis, and the data sample for training varies from 75 to 90%. Table 2 shows the analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB classifiers based on cover type data.

Analysis based on training percentage
Consider Table 1, the sensitivity of the CNB classifier with 75% of training data is 75.8% while increasing the training percentage the sensitivity of the classifier goes on increasing, similarly for all the matrices like accuracy, specificity, memory and time the improved performance is achieved with an increase in training percentage. This improved performance with the increase in training percentage is achieved in all the classifiers.

Analysis based on mappers
In

Comparative discussion
Tables 1 and 2 for all the classifiers, the increase in training percentage increases the system's overall performance in terms of accuracy, sensitivity, specificity, memory, and execution time. Likewise, while increasing the mapper size, the memory requirement and the execution time decrease. For the localization dataset, the FCNB classifier has improved performance accuracy, sensitivity, and specificity compared to other methods. Similarly, for the cover type dataset, the HCNB classifier has enhanced performance accuracy, sensitivity, and specificity compared to other techniques. For both the datasets, CNB has improved performance compared to NB, because the highest posterior value is only selected as a consequential class. GWO + CNB is better than both NB and CNB because the GWO algorithm is used to train the CNB classifier. Similarly, CGCNB has improved performance compared to NB, CNB, and GWO + CNB. In CGCNB, the Cuckoo search algorithm is incorporated with GWO; hence the better result is obtained. Finally, both FCNB and HCNB have better performance. HCNB is well suited for big data classification.

Conclusion
This paper focused on big data classification based on different functions incorporated with Map-Reduce framework. The basic model is CNB classifier and later it is integrated with optimization algorithms, like cuckoo search and grey wolf optimization. The adoption of fuzzy theory with CNB classifier with membership degree of attributes included in the dataset provides performance achievements comparatively with CNB and CGCNB classifiers. The models, such as FCNB, HCNB, and CGCNB classifier, demonstrate the enhanced performance in localization and covertype databases from simulation outcomes. In the future, the performance of the classifiers will be analyzed using log loss and training loss.