Skip to main content

Analysis of Bayesian optimization algorithms for big data classification based on Map Reduce framework

Abstract

The process of big data handling refers to the efficient management of storage and processing of a very large volume of data. The data in a structured and unstructured format require a specific approach for overall handling. The classifiers analyzed in this paper are correlative naïve Bayes classifier (CNB), Cuckoo Grey wolf CNB (CGCNB), Fuzzy CNB (FCNB), and Holoentropy CNB (HCNB). These classifiers are based on the Bayesian principle and work accordingly. The CNB is developed by extending the standard naïve Bayes classifier with applied correlation among the attributes to become a dependent hypothesis. The cuckoo search and grey wolf optimization algorithms are integrated with the CNB classifier, and significant performance improvement is achieved. The resulting classifier is called a cuckoo grey wolf correlative naïve Bayes classifier (CGCNB). Also, the performance of the FCNB and HCNB classifiers are analyzed with CNB and CGCNB by considering accuracy, sensitivity, specificity, memory, and execution time.

Introduction

The data size is increased because of the technological developments in the new data computing field. Big data refers to the large dataset unable to manage and handle the classical database systems [1]. Big data handling is a major challenge in every application due to rapid data gathering and storage, networking, and other related techniques [2, 3]. For example, big data classification is certainly needed because huge data is created, approximately 2.5 quintillion bytes of data every day [4]. Big data refers to the collection and processing of a huge amount of data [5]. The definition of big data is based upon the facts, including its size and the inability of processing by certain systems due to the high demand for primary storage and computing time [6, 7]. The extensive voluminous data processed in a year leads to the frequent use of the term big data in data analytics. This causes every organization to require data management options to handle big data [8, 9] effectively.

The characterization of big data is based on the factors such as volume, variety, and veracity associated with it. The enhanced difficulty level associated with big data processing is because data arrives in a continuous pattern from the internet sources [10, 11]. Data analytics is considered one of the challenging research problems, and data mining and machine learning techniques are used to perform analytics. The inherent complexities associated with the data are its large size, and non-uniform makes the limitations on the current data mining software tools and techniques [12]. The standard data mining and machine learning methods are not directly dealing with extensive size data effectively [13, 14]. The analysis and knowledge extraction process related to big data can be further improved.

The two major categories, namely clustering, and classification are included in data mining schemes. The big data classification process is primarily [15, 16] influenced by the various classifiers, such as Naïve Bayes [17], Support Vector Machine [18], and Extreme Learning Machine [19]. The data classification approach provided by ELM [10] algorithm is based on multiclassification rather than binary classification [20]. The observed fact about big data processing is the increased computational complexity because of the high volume [21]. The analysis of big data in supervised classification is based on the learning algorithms, and after that, it finds the appropriate classes for the datasets[22].

The analytics include the extraction of useful insights from the extensive data, and one of the major tasks associated with it is classification [23, 24]. The deep study suggests the redesign of typical classification algorithms be adopted to classify very large amounts of data. The techniques are required to classify large amounts of data for different applications. The redesign of a classification technique for large amounts of data needs to be taken care of after that; the applications that use these algorithms can limit the growth of the extensive size of data[25].

The analysis and organization techniques applicable to traditional systems are insufficient in addressing big data-related issues and challenges [26, 27]. There are several algorithms developed for performing big data analysis tasks. The compatibility of MapReduce in handling big data processing makes it most suitable for analysis tasks [28].

The problem of processing large-size data is solved using the Map-Reduce scheme, which contains Mapper and Reducer tasks in parallel on datasets [29].

The Map-Reduce principle is derived from the divide and conquer strategy of problem-solving, in which sub-data samples are created and handled independently by splitting the data samples [30, 31]. The Map level divides the source data by producing dissimilar pairs of the key value. The efforts obtained from the key value in the map function are integrated by reducing functions at the Reduce level [32].

The number of smart data analytics approaches such as image processing, multi-temporal processing, and automatic classification is increasingly in demand due to big data processing [33]. The data mining approaches integrated with the recent growing technologies are used for performing big data analytics with reduced limitations and drawbacks [34]. The MapReduce technique and distributed file system introduced by Google provides a suitable environment for processing large-scale datasets over a group of systems. The MapReduce framework is used to perform big data mining using several processors efficiently [12]. Generally, Hadoop, one of the systems, provides a parallel programming environment [35] used to execute the MapReduce framework.

The classification algorithms are employed to solve data mining issues in big data because of the fact that it is the gathering of data from different sources. The classification task is based on building a classifier model for resultant target class prediction of data items included in the dataset [36]. The emphasis on data classification motivates the adaptive use of techniques for classification in big data analysis. The classification techniques, including Bayes networks, genetic algorithms, genetic programming, and decision trees [37], are adopted to perform the classification of large data.

It is found in the literature that big data classification is performed using machine learning methods [38, 39], optimization algorithms [32, 40], Decision Tree [41], primarily along with some other approaches. The inclusion of fuzzy theory with correlative naïve Bayes classifier creates a model named the Fuzzy Correlative Naive Bayes Classifier (FCNB). The FCNB is developed for big data classification because it is implemented using the MapReduce framework. Further enhancement of the CNB classifier is done using the holoentropy function. The resultant model developed for big data classification using the MapReduce and named as Holoentropy based Correlative Naive Bayes Classifier (HCNB).

The main contribution of this paper is the analysis of big data classification techniques based on the Map-Reduce model using the classifiers, such as Correlative naïve Bayes classifier (CNB), Cuckoo Grey wolf CNB (CGCNB), Fuzzy CNB (FCNB), and Holoentropy CNB. (HCNB). The performance of the classifiers is evaluated based on accuracy, sensitivity, specificity, memory, and time. At first, CNB is compared with the existing naïve Bayes classifier. After that, further analysis shows the significant performance improvement of CGCNB in comparisons with NB and CNB. The other two developed models for big data classification named FCNB and HCNB are compared with Naïve Bayes [24], Correlative Naive Bayes (CNB) [20], Cuckoo Grey Wolf based CNB (CGCNB), and Fuzzy Naïve Bayes classifier (FNB) [24]. The classifiers are implemented in the JAVA programming language. The localization dataset and cover type dataset are taken from the UCI machine learning repository for experimentation.

This paper is organized into different sections: “Literature review” section discusses the Literature review covering the number of techniques developed using the MapReduce framework to perform classification on big data. “Descriptions of Bayesian classification methods” section describes the developed Bayesian classifiers, namely Correlative Naïve Bayes, Cuckoo Search Grey Wolves Correlative naïve Bayes, Fuzzy Correlative Naïve Bayes, and Holoentropy based correlative naïve Bayes classifiers. The developed classifiers' analysis is presented with comparisons based on the obtained results of accuracy, sensitivity, specificity, memory, and time is shown in “Results and discussion” section. Finally, the conclusion and future scope for improvements of the classifiers discussed in “Conclusion” section

Literature review

In this section, various algorithms for performing big data classification by different researchers and their importance in terms of advantages and disadvantages are presented.

Shen and Kai Gao [42] created a technique to be adopted for big data classification, and it works in the internet environment. The performance of this approach was stable, and also it can reduce the computation complexity. The approach has lower complexity than the traditional segmentation approaches, but one of the demerits of this approach is its strongest feature independence.

Simone Scardapane et al. [43] designed an algorithm for big data classification known as Echo State Networks based on utilizing the multipliers optimization procedure. The method performed the local exchanges among the nearby elements by utilizing the multiplier optimization procedure without connecting nodes. In addition to this, the communicating nodes did not require to use of training patterns. The synthetic datasets were used in experimentation and showed improvements in computation time and generalization accuracy metrics. In this method, pilot symbols are not required during communication. In contrast, the weights depend on error values is one of the drawbacks of this approach.

Jemal H. Abawajy et al. [44] implemented large, iterative multitier ensemble (LIME) classifiers and analyzed their suitability for big data classification. This classifier integrated varieties of meta-classifiers at lower levels and an iterative system developed at the higher level. The experimentation results showed that the developed classifier has improved performance compared with existing classifiers for big data. The computational cost was a major drawback of the developed ensemble classifier.

Xin et al. [45] integrated Map-Reduce framework and extreme learning machine to develop a classifier to work in a distributed environment. In the initial stage of this classifier, learning weights are computed through the multiplication of data matrices. Further stage proposed elastic ELM with Map Reduce technique. This approach's positive side was the rapid training speed and negligible human intervention because of its capability of efficient learning using dynamic processing massive training dataset.

Victoria López et al. [13] proposed an algorithm based on fuzzy rule-based classification named Chi-FRBCS-BigDataCS to deal with big imbalanced data. This algorithm dealt with the uncertainty in big data, with the strength of learning in the minimal class. This algorithm performs the fuzzy logic processes and MapReduce concept for design processes using the cost-sensitive learning approach. The performance of this algorithm is significantly improved in terms of classification accuracy and computation time.

Reshma C. Bhagat and Sachin S. Patil [46] performed a classification of imbalanced data and created a multi-classifier. At its first step, they utilized the techniques for conversion of the original datasets in the small group containing binary classes. In the next step, balanced classes were formed from the imbalanced binary class using the SMOTE algorithm. The classification was performed in the final step by utilization of the Random Forest classifier. The oversampling issues in this algorithm were handled using the MapReduce concept so that the developed method acquired the capability to handle large datasets. The fixing of dataset dependent oversampling rates was one of the problems with this algorithm.

Triguero et al. [47] used the evolutionary under-sampling method and developed a parallel model for handling big data and its related issues. Processes involved in the classification task were distributed to various cluster computing nodes due to the MapReduce technique for accomplishing classification work effectively. In addition, they also introduced a windowing approach for handling imbalanced class data, and it also increases the under-sampling speed. The advantage of this model was the scalability achievement during experimental studies. The abrupt change in computation time based on the imbalance ratio was the considerable drawback of this method.

Alessio Bechini et al. [48] developed a classifier by combining Map-Reduce concept with association rules and then named it MapReduce-based Associative Classifier (MRAC). This classifier utilized distributed classification method with the Map-Reduce model of programming and association rules. The method uses the FP-Growth algorithm and mined Classification Association Rules (CARs) and handled a distributed rule pruning classifying unlabeled patterns. The advantage of this method was related to speed up and scalability, but it failed when performed experiments on large datasets.

Shichao Zhang et al. [49] utilized the combination of clustering and classification approaches of K-Nearest Neighbors (KNN) and k-means clustering. The k-means clustering was used for dividing the entire dataset into various parts, and kNN classification was performed in each part of the cluster. The results of the experiments performed on medical images and big data showed between efficiency and accuracy. This approach had a drawback of selecting the value of k because of the degradation of classification accuracy when the value of k was increased.

The nature-inspired meta-heuristic algorithm developed by Seyedali Mirjalili et al. [50] which was based on the social and hunting behavior of grey wolves. The algorithm was adopted in computer science for the optimization process and named Grey Wolves Optimization algorithm (GWO). The grey wolves' leadership hierarchy and hunting technique were observed, and accordingly, the GWO algorithm was framed. The GWO algorithm has superior performance improvements when compared with other algorithms. The algorithms compared with GWO were Evolution Strategy (ES), Evolutionary Programming (EP), Differential Evolution (DE), Gravitational Search Algorithm (GSA), and Particle Swarm Optimization (PSO).

Cuckoo search is another meta-heuristic algorithm developed by Xin-She Yang and Suash Deb [51] to provide an approach towards the solution of optimization problems. The algorithm combines the cuckoo species' brood parasitic behavior [52] with the Levy flight behavior of some birds and fruit flies. The experimental results of this algorithm showed improved performance as compared with the genetic algorithm and particle swarm optimization techniques.

Simon Fong et al. [53] developed a mining method from streaming data while avoiding the computational challenges. The developed algorithm was based on lightweight feature selection approaches and used to avoid high dimensionality of the accepted data mining. This method was suitable for real-time applications and showed effectiveness in terms of high accuracy, low latency, and robustness for solving big data problems. S.Md. Mujeeb et al. [54] developed Big data classification based on MapReduce Framework for effective data management using the E-Bat algorithm. The performance is analyzed based on accuracy and Total Positive rate (TPR). The data management is better but, the accuracy obtained is not high. Shichao Zhang et al. [55] developed big data classification using kNN (k nearest neighbors). It is well known easy learning algorithm for real time applications and efficient algorithm. However, the selection of k value must be very small to get better accuracy. Such selection is not possible for all data sets. William et al. [56] developed multi-class imbalanced big data classification on Spark. In this, the clustering-based data partitioning is done to eliminate the big data classification problems. The Random Forest and Naive Bayes classification method is used. They obtained better predictive power for the distributed environment. The major drawback of this method is its time complexity. Selvi et al. [57] developed Optimal Feature Selection for big data classification. The Firefly with Lion-Assisted Model is employed for the feature selection. The model obtains the effective classification, but it is not applicable for the noisy data. Mujeeb et al. [58] developed deep learning for big data classification. The Adaptive Exponential Bat algorithm is devised for training the classifier. They obtained better accuracy, TPR, and TNR. The security constraints are the major drawback of this method.

The major challenges in the existing techniques are computational complexity, time, cost, oversampling, and speed. These drawbacks were overcome by using the Bayesian classifiers. The CNB classifier is well suited for imbalanced datasets. By using the optimization algorithms, better convergence is obtained with improved accuracy with low computational complexity.

Descriptions of Bayesian classification methods

Naïve Bayes classifier (NB)

The different classifiers based on the Bayesian theorem are known as Bayesian classifiers; for example, NB is one of the bayesian classifiers. The Bayesian classification is based on the posterior probability calculation by assumed prior probabilities and the probability of different data object under given assumptions.

The NB classifier is based on the fact that each attribute of the object to be classified is independent of each other. The NB classifier is based on the approach where the probability of each category is calculated, and the object belongs to the category with the largest probability associated.

Correlative Naive Bayes classifier (CNB)

One of the highly utilized classifiers is NB classifier, and the typical classifier is adopted with the Map-Reduce framework and used for big data classification [59]. At the initial phase of the training process, the input data are arranged in different groups based on the number of classes.

$${\mathrm{CNB}.}_{\mathrm{QXm}}=\{{\upmu }_{\mathrm{Q}\times \mathrm{m}},{\upsigma }_{\mathrm{Q}\times \mathrm{m}},{\mathrm{R}}_{\mathrm{Q}\times 1}\}$$
(1)

where, \({\mu }_{Q\times m}\) is the mean to be calculated, \({\upsigma }_{Q\times m}\) is specified as variance, \({R}_{Q\times 1}\) denotes correlation function, and it is illustrated in vector form. Testing data result is represented using the following equation:

$${\text{C}} = \arg_{{{\text{q}} = {\text{i}} \ldots {\text{Q}}}}^{{{\text{Max}}}} \left[ {{\text{P}}\left( {{\text{C}}_{{\text{q}}} } \right) \times {\text{P(X |C}}_{{\text{q}}} )} \right] \times {\text{R}}^{{\text{q }}}$$
(2)

The Eq. (2) indicates that the highest posterior value is only selected as a consequential class.

Cuckoo Grey Wolf optimization with correlative Naïve Bayes classifier (CGCNB)

The integrated CGWO and CNB classifier with MapReduce framework is named as CGCNB-MRP [60]. CGWO is the integration of the cuckoo search algorithm and grey wolf Optimization (GWO). The block diagram of the developed model for big data classification is depicted in Fig. 1.

Fig. 1
figure 1

Control and Data Flow of CGCNB-MRP classifier

Fuzzy correlative Naive Bayes classifier (FCNB)

One more extension of CNB classifier is the inclusion of fuzzy theory and it is named as FCNB [61]. FCNB classifier is shown in the following equation:

$${\mu }_{q}^{s}= \frac{\left|{m}_{q}^{s}\right|}{d}$$
(3)

The term \({\mu }_{q}^{s}\) represents membership degree of symbol sth in the qth element of the training model. The whole incidence of sth symbol in qth element is represented by the term \(\left|{m}_{q}^{s}\right|\) and d is utilized for representing the data sample in the attribute. Each data sample is classified into the number of classes let it be called K. It is represented as follows:

$$\mu_{c}^{k} = \frac{{\left| {m^{k} } \right|}}{d}.$$
(4)

Here the whole incidence of Kth class in ground truth information is represented by \(|{m}^{k}|\). The outcome of FCNB classifier is represented as below:

$$\mathrm{FCNB}=\{{\mu }_{q}^{k},{ \mu }_{c}^{k}, C\}.$$
(5)

In the developed FCNB classifier's testing stage, the testing sample is classified into the appropriate classes by using the posterior probability of naïve Bayes, fuzzy membership degree, and a correlation factor among attributes. The output of FCNB is expressed as below equation:

$$G={\mathrm{arg}}_{k=1 to K}^{Max} P\left({\mathrm{g}}_{k}|X\right)*{C}^{k}.$$
(6)

The term here is represented as P(gk | X) denoted as a posterior probability by using test data X for given class gk. The expression Cksignifies correlation for class K.

Holoentropy using correlative naïve bayes classifier for a big data classification (HCNB)

The new classification technique called HCNB [62] is introduced by combining the existing CNB classifier with the holoentropy function. The handling is based on the holoentropy estimation for each attribute using the following formula:

$${H}_{v }^{b}=\mathrm{ F }\times \mathrm{ T}\left(\mathrm{ ib }\right).$$
(7)

Here, F represents the weight function, and T(ib) is the entropy. The formulae for the weight function and entropy is described in the following equations.

$$\mathrm{F}=2\left[1- \frac{1}{1+\mathrm{exp}(-\mathrm{T}({\mathrm{i}}_{\mathrm{b}}))}\right].$$
(8)
$${\text{T}}\left( {{\text{i}}_{{\text{b}}} } \right) = { } - { }\mathop \sum \limits_{{{\text{b}} = 1}}^{{{\text{M}}\left( {{\text{i}}_{{\text{b}}} } \right)}} {\text{P}}_{{\text{b }}} \times {\text{ log P}}_{{\text{b }}} .$$
(9)

Here, M(ib) is the unique value of the attribute vector ib. The training phase of the HCNB based on the training data samples produces the result in the following vector form:

$${\text{HCNB}}_{{{\text{a}} \times {\text{s}}}} = \left\{ {{\upmu }_{{{\text{a}} \times {\text{s}}}} ,{\upsigma }^{2}_{{{\text{a}} \times {\text{s }}}} ,{\text{C}}_{{{\text{a}} \times 1}} ,{\text{ H}}_{{{\text{a}} \times {\text{s}}}} } \right\}.$$
(10)

Here, \({\upmu }_{\mathrm{a}\times \mathrm{s}}\) and \({{\upsigma }^{2}}_{\mathrm{a}\times \mathrm{s}}\) represent the computed mean value and computed variance value between the attributes a and s, respectively. Also, the correlation is represented by Cax1, and the holoentropy function is represented using \({\mathrm{H}}_{\mathrm{a}\times \mathrm{s}}\). The individual class is selected by estimating posterior probability independently during a testing phase, which the below expression can represent:

$$P\left(Y|{k}_{v}\right)= \prod_{b=1}^{s}P\left(Y={y}_{b}|{k}_{v}\right)$$
(11)

where, yb illustrates yth data of bth element, and kv indicates vth class number. The block diagram of the HCNB classifier is depicted in Fig. 2.

Fig. 2
figure 2

Block diagram of HCNB classifier

Results and discussion

This section presents the classifiers' evaluation results, and a comparative detail analysis is provided with the existing methods. The system requirements and implementation details are provided in the experiment setup.

Experimental setup

The system configuration used for performing the experiment contains Windows 10 OS running on Intel processor CPU 2.16 GHz with 2 GB RAM capacity. The methods included in the developed classifiers are implemented in JAVA programming language. The parameters used for the experimentation are maximum iteration-5, population size-6, and mapper size-5.

Dataset description

The dataset utilized for the experimentation purpose is localization dataset and cover type dataset.

The localization dataset is taken from the UCI machine learning repository for experimentation [63]. The recorded activities of five people wearing tags, ankle left, ankle right, belt, and chest are collected in terms of the localization dataset. The dataset contains a total of 164,860 instances that include 8 attributes. Each instance in the dataset forms localization data for tags, and attributes are used to recognize them.

The cover type dataset is taken from the UCI machine learning repository for experimentation [64]. The dataset contains total of 581,012 instances with 54 attributes.

Metrics used for performance evaluation

Five metrics, such as accuracy, sensitivity, specificity, memory, and time, are used to evaluate the performance of the classifiers. The degree of veracity is measured using an accuracy metric defined as the proportion of true results. The sensitivity and specificity are referred to as the proportion of correctly identified true positives and true negatives.

$$Acc = \frac{{T_{N} + T_{P} }}{{T_{N} + T_{P} + F_{N} + F_{P} }}$$
(A)
$$Sens = \frac{{T_{P} }}{{T_{P} + F_{N} }}$$
(B)
$$Spec = \frac{{T_{N} }}{{T_{N} + F_{P} }}$$
(C)

where, \(T_{N}\) is True Negative, \(T_{P}\) is True Positive, \(F_{N}\) is False Negative, and \(F_{P}\) is False Positive.

Comparative analysis of Bayesian models

A comparative analysis is done to evaluate the developed classifiers with the existing models based on sensitivity, specificity, and accuracy.

In this paper, the first model C.N.B. is compared with the existing naïve Bayes classifier. After that, further analysis shows the significant performance improvement of CGCNB in comparisons with NB and CNB. The other two developed models for big data classification named FCNB and HCNB are compared with Naïve Bayes [24], Correlative Naive Bayes (CNB) [20], Cuckoo Grey Wolf based CNB (CGCNB), and Fuzzy Naïve Bayes classifier (FNB) [24].

The naïve Bayes classifier incorporating the fuzzy theory is named the Fuzzy Naïve Bayes classifier (FNB); the comparison of FCNB classifier shows enhanced performance achievement and adopted to perform the big data classification.

The holoentropy extension of the CNB classifier (HCNB) is created and compared with the existing models with similar conditions and parameters and observed improvements when its performance is assessed on the localization dataset and the cover type dataset. The comparative study of the models is done for training sample data taken from 75 to 90% for the number of number mappers as M = 5. The number of mappers represents the number of desktops used for simulation.

Analysis using localization dataset

Performance evaluation of different Bayesian classifiers

The developed classifiers CNB and CGCNB are evaluated based on accuracy, sensitivity, specificity, memory, and time analysis on the localization dataset. The performance evaluation is presented in this section. The performance evaluation process is carried out with a mapper size of 5 and on training data. Table 1 shows the analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB classifiers based on localization data.

Table 1 Analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB based on localization data

Analysis based on training percentage

Consider Table 1, the accuracy of the NB classifier with 75% of training data is 76.3% while increasing the training percentage the accuracy of the classifier goes on increasing, similarly for all the matrices like sensitivity, specificity, memory and time the improved performance is achieved with an increase in training percentage. This improved performance with the increase in training percentage is achieved in all the classifiers.

Analysis based on mappers:

In Table 1, when considering the HCNB classifier for mapper size 2 with a training percentage of 75%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 85, 94, 89.2, 166, and 1.99, respectively. Similarly, for mapper size 5 with a training percentage of 75%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 85.3, 94.2, 89.3, 122, and 1.81. From the above analysis, we can interpret that the execution time and memory decrease with the increase in the mapper size. In this proposed work, the mapper size depicts the number of desktops used.

Analysis using cover type Dataset

Performance evaluation of different Bayesian classifiers

The performance evaluation using the cover type dataset is presented in this section. The developed classifiers are evaluated based on accuracy, sensitivity, specificity, memory, and time. The performance evaluation process is carried out by varying the number of mappers and the training data. The number of mappers used is 5, representing the number of desktops used for simulation of big data analysis, and the data sample for training varies from 75 to 90%. Table 2 shows the analysis of NB, CNB, GWO + CNB, CGCNB, FCNB, and HCNB classifiers based on cover type data.

Table 2 Analysis of CNB and CGCNB

Analysis based on training percentage

Consider Table 1, the sensitivity of the CNB classifier with 75% of training data is 75.8% while increasing the training percentage the sensitivity of the classifier goes on increasing, similarly for all the matrices like accuracy, specificity, memory and time the improved performance is achieved with an increase in training percentage. This improved performance with the increase in training percentage is achieved in all the classifiers.

Analysis based on mappers

In Table 1, when considering the FCNB classifier for mapper size 3 with a training percentage of 90%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 79.8, 82.9, 73.7, 26.6, and 18.4, respectively. Similarly, for mapper size 5 with the training percentage of 90%, the performance matrices like accuracy, sensitivity, specificity, memory, and the execution time are 80.1, 83.4, 73.9, 25.5, and 16.8. From the above analysis, we can interpret that the execution time and memory decrease with the increase in the mapper size. In this proposed work, the mapper size depicts the number of desktops used.

Comparative discussion

Tables 1 and 2 for all the classifiers, the increase in training percentage increases the system's overall performance in terms of accuracy, sensitivity, specificity, memory, and execution time. Likewise, while increasing the mapper size, the memory requirement and the execution time decrease. For the localization dataset, the FCNB classifier has improved performance accuracy, sensitivity, and specificity compared to other methods. Similarly, for the cover type dataset, the HCNB classifier has enhanced performance accuracy, sensitivity, and specificity compared to other techniques.

For both the datasets, CNB has improved performance compared to NB, because the highest posterior value is only selected as a consequential class. GWO + CNB is better than both NB and CNB because the GWO algorithm is used to train the CNB classifier. Similarly, CGCNB has improved performance compared to NB, CNB, and GWO + CNB. In CGCNB, the Cuckoo search algorithm is incorporated with GWO; hence the better result is obtained. Finally, both FCNB and HCNB have better performance. HCNB is well suited for big data classification.

Conclusion

This paper focused on big data classification based on different functions incorporated with Map-Reduce framework. The basic model is CNB classifier and later it is integrated with optimization algorithms, like cuckoo search and grey wolf optimization. The adoption of fuzzy theory with CNB classifier with membership degree of attributes included in the dataset provides performance achievements comparatively with CNB and CGCNB classifiers. The models, such as FCNB, HCNB, and CGCNB classifier, demonstrate the enhanced performance in localization and covertype databases from simulation outcomes. In the future, the performance of the classifiers will be analyzed using log loss and training loss.

Availability of data and materials

The data used for the analysis is taken from the localization dataset, available at, https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity and cover type dataset link is available at, "dataset link: https://archive.ics.uci.edu/ml/datasets/Covertype".

References

  1. Benabderrahmane S, Mellouli N, Lamolle M, Paroubek P. Smart4Job: a big data framework for intelligent job offers broadcasting using time series forecasting and semantic classification. Big Data Research. 2017;7:16–30.

    Article  Google Scholar 

  2. Thanekar SA, Subrahmanyam K, Bagwan AB. Big data and MapReduce challenges, opportunities and trends. Int J Electr Comput Eng. 2016; 6(6): 2911–2919. https://doi.org/10.11591/ijece.v6i6.10555.

  3. Raghav RS, Amudhavel J, Dhavachelvan P. A survey on tools used in big data platform. Adv Appl Math Sci. 2017;17(1):213–29.

    Google Scholar 

  4. Wu X, Zhu X, Wu GQ, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

    Article  Google Scholar 

  5. Marx V. The big challenges of big data. Nature. 2013;498(7453):255–60.

    Article  Google Scholar 

  6. Minelli M, Chambers M, Dhiraj A. Big Data, big analytics: emerging business intelligence and analytic trends for today’s businesses. 1st ed. New York: Wiley Publishing; 2013.

    Book  Google Scholar 

  7. Pole G, Gera P. A recent study of emerging tools and technologies boosting big data analytics. 2016. https://doi.org/10.1007/978-981-10-0419-3_4.

  8. Lin W, Wu Z, Lin L, Wen A, Li J. An Ensemble Random Forest Algorithm for Insurance Big Data Analysis. IEEE Access. 2017;5:16568–75.

    Article  Google Scholar 

  9. Patil SS, Sonavane SP. Enriched Over_Sampling Techniques for Improving Classification of Imbalanced Big Data. In: Proceedings of IEEE Third International Conference on Big Data Computing Service and Applications (Big Data Service), San Francisco, CA, pp. 1–10, 2017.

  10. Chen J, Chen H, Wan X, Zheng G. MR-ELM: a MapReduce-based framework for large-scale ELM training in big data era. Neural Comput Appl. 2016;27(1):101–10.

    Article  Google Scholar 

  11. Radha K, Thirumala Rao B. Research issues and challenges of big data. Int J Contr Theory Appl. 2016;9(17):8437–44.

    Google Scholar 

  12. Tsai C-F, Lin W-C, Ke S-W. Big data mining with parallel computing: a comparison of distributed and MapReduce methodologies. J Syst Softw. 2016;122:83–92.

    Article  Google Scholar 

  13. López V, del Río S. José Manuel Benítez, Francisco Herrera, “Cost-sensitive linguistic fuzzy rule based classification systems under the MapReduce framework for imbalanced big data.” Fuzzy Sets Syst. 2015;258:5–38.

    Article  Google Scholar 

  14. Sucharita V, Jyothi S, Rao PV. Comparison of machine learning algorithms for classification of penaeid prawn species. In: Paper presented at the Proceedings of the 10th INDIACom; 2016 3rd International Conference on Computing for Sustainable Global Development, INDIACom 2016, 1610–1613.

  15. Haque A, Parker B, Khan L, Thuraisingham B. Evolving Big Data Stream Classification with MapReduce. In: Proceedings of IEEE 7th International Conference on Cloud Computing, Anchorage, AK, pp. 570–577, 2014.

  16. Hegazy O, Safwat S, Bakry ME. A mapreduce fuzzy techniques of big data classification. In: Proceedings of the SAI Computing Conference (SAI), London, pp. 118–128, 2016.

  17. Santafe G, Lozano JA, Larranaga P. Bayesian Model averaging of naive bayes for clustering. IEEE Trans Syst Man Cybern. 2006;36(5):1149–61.

    Article  Google Scholar 

  18. Huang X, Shi L, Suykens JK. Support vector machine classifier with pinball loss. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 36, no. 5, pp. 984–997, 2014.

  19. Huang GB, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1):489–501.

    Article  Google Scholar 

  20. Duan M, Li K, Liao X, Li K. A parallel multiclassification algorithm for big data using an extreme learning machine. IEEE Trans Neural Netw Learn Syst. 2017;29(6):2337–51.

    Article  MathSciNet  Google Scholar 

  21. Arnaiz-González Á, González-Rogel A, Díez-Pastor JF, López-Nozal C. MR-DIS: democratic instance selection for big data by MapReduce. Progr Artif Intell. 2017;6(3):211–9.

    Article  Google Scholar 

  22. Potharaju SP, Sreedevi M. Distributed feature selection (DFS) strategy for microarray gene expression data to improve the classification performance. Clin Epidemiol Global Health. 2019;7(2):171–6.

    Article  Google Scholar 

  23. Segatori A, Marcelloni F, Pedrycz W. On distributed fuzzy decision trees for big data. IEEE Trans Fuzzy Syst. 2018;26(1):174–92.

    Article  Google Scholar 

  24. Bechini A, Marcelloni F, Segatori A. A MapReduce solution for associative classification of big data. Inf Sci. 2016;332:33–55.

    Article  Google Scholar 

  25. Manekar AK, Pradeepini G. Cloud based big data analytics a review. In: Paper presented at the Proceedings - 2015 International Conference on Computational Intelligence and Communication Networks, CICN 2015, 2016; 785–788. https://doi.org/10.1109/CICN.2015.160.

  26. Hu H, Wen Y, Chua TS, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

    Article  Google Scholar 

  27. Bechini A, Marcelloni F, Segatori A. A MapReduce solution for associative classification of big data. Inform Sci. 2016;332:33–55.

    Article  Google Scholar 

  28. Priyadarshini A, Agarwal S. A Map-Reduce based support vector machine for big data classification. Int J Database Theory Appl. 2015;8(5):77–98.

    Article  Google Scholar 

  29. Bhukya R, Gyani J. Fuzzy associative classification algorithm based on MapReduce framework. In: Proceedings of the international conference on applied and theoretical computing and communication technology (iCATccT), Davangere, pp. 357–360, 2015.

  30. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. ACM Commun. 2008;51(1):107–13.

    Article  Google Scholar 

  31. Elkano M, Galar M, Sanz J, Bustince H. CHI-BD: A Fuzzy Rule-Based Classification System for Big Data classification problems. Fuzzy Sets Syst. 2018;348:75–101.

    Article  MathSciNet  Google Scholar 

  32. Polepally V, Chatrapati KS. Dragonfly optimization and constraint measure-based load balancing in cloud computing. Cluster Comput. 2019;22(1):1099–111.

    Article  Google Scholar 

  33. Cavallaro G, Riedel M, Richerzhagen M, Benediktsson JA, Plaza A. On understanding big data impacts in remotely sensed image classification using support vector machine methods. IEEE J Select Top Appl Earth Observ Remote Sens. 2015;8(10):4634–46.

    Article  Google Scholar 

  34. Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR: A MapReduce solution for prototype reduction in big data classification. Neurocomputing. 2015;150:331–45.

    Article  Google Scholar 

  35. Dean J, Ghemawat S. Map reduce: a flexible data processing tool. Commun ACM. 2010;53(1):72–7.

    Article  Google Scholar 

  36. Kamal MS, Parvin S, Ashour AS, Shi F, Dey N. De-Bruijn graph with MapReduce framework towards metagenomic data classification. Int J Inform Technol. 2017;9(1):59–75.

    Article  Google Scholar 

  37. Arnaiz-González Á, González-Rogel A, Díez-Pastor JF, López-Nozal C. MR-DIS: democratic instance selection for big data by MapReduce. Progr Artif Intell. 2017;6(3):211–9.

    Article  Google Scholar 

  38. Ratre A. Taylor series based compressive approach and Firefly support vector neural network for tracking and anomaly detection in crowded videos. J Eng Res. 2019;20(7):4.

    Google Scholar 

  39. Arul VH. An approach for speech enhancement using deep convolutional neural network. Multimedia Res. 2019;2(1):37–44.

    MathSciNet  Google Scholar 

  40. More NS, Ingle RB. Energy-aware VM migration using dragonfly–crow optimization and support vector regression model in Cloud. Int J Model Simul Sci Comput. 2018;9(06):1850050.

    Article  Google Scholar 

  41. Daga BS, Bhute AN. Predicting recurrence pattern in breast cancer Using Decision Tree. 2009.

  42. Gao S, Gao K. Modelling on Classification and Retrieval Strategy in Map-Reduce Based IR System. In: proceedings of 2014 International Conference on Modelling, Identification and Control, Melbourne, Australia, December 3–5, 2014.

  43. Scardapane S, Wang D, Panella M. A decentralized training algorithm for echo state networks in distributed big data applications. Neural Networks. 2016;1(78):65–74.

    Article  MATH  Google Scholar 

  44. Abawajy JH, Kelarev A, Chowdhury M. Large iterative multitier ensemble classifiers for security of bigdata. IEEE Trans Emerg Top Comput. 2014;2(3):352–63.

    Article  Google Scholar 

  45. Xin J, Wang Z, Luxuan Qu, Wang G. Elastic extreme learning machine for big data classification. Neurocomputing. 2015;149:464–71.

    Article  Google Scholar 

  46. Bhagat RC, Patil SS. Enhanced SMOTE Algorithm for Classification of Imbalanced Big-Data usingRandom Forest. In: Proceedings of IEEE International on Advance Computing Conference (IACC), pp. 403–408, 2015.

  47. Triguero I, Galar M, Vluymans S, Cornelis C, Bustince H, Herrera F, Saeys Y. Evolutionary undersampling for imbalanced big data classification. Evol Comput. 2009;17(3):275–306.

    Article  Google Scholar 

  48. AlessioBechini FM. A MapReduce solution for associative classification of big data". Inf Sci. 2016;332:33–55.

    Article  Google Scholar 

  49. Deng Z, Zhu X, Cheng D, Zong M, Zhang S. Efficient kNN classification algorithm for big data. Neurocomputing. 2016;195:143–8.

    Article  Google Scholar 

  50. SeyedaliMirjalili SM, Mirjalili AL. Grey Wolf Optimizer. Adv Eng Softw. 2014;69:46–61.

    Article  Google Scholar 

  51. Yang XS, Deb S. Cuckoo search via Levy flights. In: Proceedings of World Congress on Nature & Biologically Inspired Computing (NaBIC 2009), Coimbatore, IEEE Publications, USA, pp. 210–214. 2009.

  52. Thirugnanasambandam K, Prakash S, Subramanian V, et al. Reinforced cuckoo search algorithm-based multimodal optimization. Appl Intell. 2019;49:2059–83. https://doi.org/10.1007/s10489-018-1355-3.

    Article  Google Scholar 

  53. Fong S, Wong R, Vasilakos AV. Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput. 2016;9(1):33–45.

    Google Scholar 

  54. Mujeeb SM, Sam RP, Madhavi K. Adaptive hybrid optimization enabled stack autoencoder-based MapReduce framework for big data classification. In: 2020 International Conference on Emerging Trends in Information Technology and Engineering (ic-ETITE) 2020, pp. 1–5.

  55. Deng Z, Zhu X, Cheng D, Zong M, Zhang S. Efficient kNN classification algorithm for big data. Neurocomputing. 2016;26(195):143–8.

    Article  Google Scholar 

  56. William C, Sleeman IV, and Bartosz Krawczyk. Multi-class imbalanced big data classification on Spark. Knowledge-Based Systems, 2020.

  57. Selvi RS, Valarmathi ML. Optimal feature selection for big data classification: firefly with lion-assisted model. Big data. 2020;8(2):125–46.

    Article  Google Scholar 

  58. Mujeeb SM, Sam RP, Madhavi K. Adaptive Exponential Bat algorithm and deep learning for big data classification. Sādhanā. 2021;46(1):1–5.

    Article  MathSciNet  Google Scholar 

  59. Chitrakant B, Srinivasu N. CNB-MRF: Adapting correlative naive bayes classifier and mapreduce framework for big data classification. Int Rev Comput Softw. 2016. https://doi.org/https://doi.org/10.15866/irecos.v11i11.10116

  60. Chitrakant Banchhor N, Srinivasu, . Integrating Cuckoo search-Grey wolf optimization and Correlative Naive Bayes classifier with Map Reduce model for big data classification. Data Knowl Eng. 2020. https://doi.org/10.1016/j.datak.2019.101788.

    Article  Google Scholar 

  61. ChitrakantBanchhor NS. FCNB: Fuzzy Correlative naive bayes classifier with mapreduce framework for big data classification. J Intell Syst. 2018. https://doi.org/10.1515/jisys-2018-0020.

    Article  Google Scholar 

  62. ChitrakantBanchhor NS. Holoentropy based Correlative Naive Bayes classifier and MapReduce model for classifying the big data. Evol Intel. 2019. https://doi.org/10.1007/s12065-019-00276-9.

    Article  Google Scholar 

  63. Localization dataset, https://archive.ics.uci.edu/ml/datasets/Localization+Data+for+Person+Activity.

  64. 'CovertypeDataset', UCI Machine Learning Repository, https://archive.ics.uci.edu/ml/datasets/Covertype. 2020.

Download references

Acknowledgements

I would like to express my very great appreciation to Dr. N. Srinivasu for his valuable and constructive suggestions during the planning and development of this research work. His willingness to give his time so generously has been very much appreciated. Also, I wish to thank my parents for their support and encouragement throughout my study.

Funding

None.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the design and implementation of the research, to the analysis of the results and to the writing of the manuscript. Both authors read and approved the final manuscript.

Corresponding author

Correspondence to Chitrakant Banchhor.

Ethics declarations

Ethics approval and consent to participate

This paper does not use any of the animals or humans for the authors' studies.

Consent for publication

None.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Banchhor, C., Srinivasu, N. Analysis of Bayesian optimization algorithms for big data classification based on Map Reduce framework. J Big Data 8, 81 (2021). https://doi.org/10.1186/s40537-021-00464-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-021-00464-4

Keywords