Skip to main content

Online Feature Selection (OFS) with Accelerated Bat Algorithm (ABA) and Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) for big data streams

Abstract

Feature selection is mainly used to lessen the dispensation load of data mining models. To condense the time for processing voluminous data, parallel processing is carried out with MapReduce (MR) technique. However with the existing algorithms, the performance of the classifiers needs substantial improvement. MR method, which is recommended in this research work, will perform feature selection in parallel which progresses the performance. To enhance the efficacy of the classifier, this research work proposes an innovative Online Feature Selection (OFS)–Accelerated Bat Algorithm (ABA) and a framework for applications that streams the features in advance with indefinite knowledge of the feature space. The concrete OFS-ABA method is suggested to select significant and non-superfluous feature with MapReduce (MR) framework. Finally, Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) classifier is applied to classify the dataset samples. The outputs of homogeneous IDMLP classifiers were combined using the EIDMPL classifier. The projected feature selection method along with the classifier is evaluated expansively on three datasets of high dimensionality. In this research work, MR-OFS-ABA method has shown enhanced performance than the existing feature selection methods namely PSO, APSO and ASAMO (Accelerated Simulated Annealing and Mutation Operator). The result of the EIDMLP classifier is compared with other existing classifiers such as Naïve Bayes (NB), Hoeffding tree (HT), and Fuzzy Minimal Consistent Class Subset Coverage (FMCCSC)-KNN (K Nearest Neighbour). The methodology is applied to three datasets and results were compared with four classifiers and three state-of-the-art feature selection algorithms. The outcome of this research work has shown enhanced performance in accuracy and less processing time.

Introduction

Big data refers to any assortment of data which are outsized and intricate in nature such that conventional database administration systems and data processing tools cannot process. Big data can be characterized by 5 V’s [1, 2] namely “Volume”, “Variety”, “Velocity”, “Variability”, and “Veracity” (Fig. 1).

Fig. 1
figure 1

Big data V’s

From the perception of these challenges, it is acknowledged that the conventional data mining methods are neither appropriate for data stream of diverse characteristics nor to achieve the analytical efficiency as it involves periodic analytics in contrast to big data which involves real-time analytics. Also, every time the induction model has to re-run and rebuilt for adding up of the recent data. Besides, to ensure scalability, MapReduce framework [3,4,5] is implemented to parallelize the classification algorithm. Presently, many Machine Learning (ML) techniques are premeditated to big data analytics [6]. For glitches involving big data sets of varied type and nature, deep learning (DL) techniques were formulated for improved performance classification.

DL algorithms are quite beneficial as it handles multifaceted, complex and unstructured data by a greedy layer-wise learning [7, 8]. DL procedures have provided significant contributions along with ML applications explicitly speech recognition systems [9], computer vision [10], and NLP [11]. DL has been proficiently used for addressing vivacious issues in big data analytics.

Feature selection (FS) is a momentous step in any classification application. It becomes a complex task when data features are huge. For several years, many research works have focused on FS methods [12, 13]. FS essentially involves the removal of extraneous and redundant features, thereby creating a prediction model with higher efficiency, interpretability, and speed. FS has been applied to many applications involving high dimensional data. Though, comprehensive methods are accessible for FS, there is an exposed challenge in handling rapid big data streams that requires instantaneous processing.

The current literature features many FS methods, but they are conducted in batch mode rather online. In offline mode, all features are available for training before the FS process. Taking into consideration of real-time applications, features may arrive online and accumulation of all training examples becomes expensive. So, OFS [14] is introduced for selecting the features using an online learning approach. In order to extract significant perceptions from big data sets in the online learning process, several relevant features must be efficiently identified. These features have to be also effectively identified from the big data sets so that accurate prediction models can be built in real time. The OFS with data streams is closely associated with data stream mining [15].

When data becomes uncontrollable or huge, the parallel processing is employed for reducing the time complexity. In this research effort, an ascendable efficient OFS method using the parallel Accelerated Bat Algorithm (ABA) technique is proposed to select the features from the data set online. In addition, the proposed Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) classifier is used for large scale data. To work with large-scale dataset, a distributed programming model, MapReduce is used which divides the dataset into smaller portions. The scalability of OFS-ABA over an extremely high dimensional and big dataset is proven through an empirical study which also demonstrates, that the algorithm performs superlatively well than the other state-of- the-art FS methods.

This research work is structured into various segments. The segment on “Literature review” skeletons the related work done in the field of feature selection and classification, and the motivation behind this research work. The “Proposed methodology” section describes the proposed method in a step by step process, starting from pre-processing to classification. The “Experimental work” section mandates the datasets used, the evaluation metrics, and the outcomes are envisaged as tables and diagrams. The final section, “Conclusion” recapitulates the entire work.

Literature review

Feature selection (FS) technique for big data analytics is envisioned to have a significant feature selection method with reduced time complexity and enhanced accuracy levels. The recent progresses in OFS with MapReduce have given a major revolution in this domain. In the recent years of development, bio-inspired associated algorithms have been used for various problems of big data analytics [16].

Hoi et al. [14] has architected an effective algorithm that provides a solution to the problem by giving a theoretical analysis and assessing the performance empirically for OFS on benchmark datasets. The application of OFS was established on real time issues, which significantly scales when compared to other FS algorithms. The outcomes are validated with the efficacy of the projected techniques for extensive and varied large scale applications.

Peralta et al. [17] projected a MapReduce approach to derive a subset of features from large data sets. The FS method was assessed by classifiers such as support vector machine (SVM), Naïve Bayes (NBs) along with logistic regression (LR). The evaluations showed that the spark implemented framework was beneficial to perform evolutionary FS on large data sets with enhanced classification precision and runtime. Tsamardinos et al. [18] accessed the Parallel, forward–backward with pruning (PFBP) algorithm for FS for huge datasets. The experimental study demonstrated increased scalability (number of features) with speedup.

Tan et al. [19] evidenced a novel FS algorithm on big datasets. The algorithm was based on convex semi-infinite programming (SIP), and multiple kernels learning (MKL) sub-problem, which is an adaptive accelerated proximal gradient technique, where each base kernel is associated with a set of features. The results show an improved training, competence over bigger data with ultra-large sample size.

De la Iglesia et al. [20] swotted diverse Evolutionary Computation (EC) for FS in classification problems. The development of EC is the competence to efficiently search large population. The assessment and implementation uncovered the competency of these algorithms and further leads to new research direction in FS problems. Nazar and Senthilkumar [21] contributed an efficient, scalable OFS, which used the Sparse Gradient (SGr) for the online selection of features. In this approach, based on the threshold value, the feature weights were proportionally decremented, which zeroed irrelevant featured weights. The experimental results demonstrated heightened accuracy of 15% compared to other methods.

Hu et al. [22] elaborated on a conventional online FS stream dataset. A comprehensive review of the present OFS method was analyzed and compared over other methods. The uncluttered issues were discussed in FS. Yu et al. [23] built a Scalable and Precise Online Feature Selection (SAOLA), online FS model built on pair wise comparison techniques and extended to online group FS. On the augmentation side, SAOLA algorithms were scalable, on high dimensional data sets. It exhibited a superior performance compared to other prevailing algorithms.

The review of the feature selection methods for handling data stream has been discussed in many recent works [24,25,26,27,28,29]. Fong et al. [24] proposed a novel, lightweight Accelerated Particle Swarm Optimization (APSO) feature selection algorithm for big data streams. The APSO algorithm is based on swarm intelligence and the proven results show that the algorithm performed well in terms of accuracy, time complexity, and so on. Five benchmarks datasets are experimented in this work.

Said and Alimi [25] crafted a Multi-Objective Automated Negotiation based Online Feature Selection (MOANOFS). The results demonstrated that MOANOFS system can be successfully applied to diverse domains and were able to accomplish high accuracy on real time applications. Lin et al. [26] estimated an improved cat swarm optimization (ICSO) algorithm for big data classification. The algorithm is pragmatic for FS in text classification problems in big data analytics. The proposed ICSO is compared with CSO. The disadvantage here is, it was only pertinent to text classification problems.

Gu et al. [27] projected the competitive swarm optimizer which is a variant of the PSO algorithm, overcomes the shortcomings of conventional PSO when handling large scale datasets, with less computational cost. The algorithm performs FS to select minimal subsets followed by classification. The future work is protracted to explore multi-objective, meta-heuristics FS algorithm to handle huge dimensionality with enhanced accuracy.

Manoj et al. [28] prospectively came up with the ACO–ANN algorithm for FS in big data analytics for text classification. The challenge in terms of this approach is to apply other types of data such as images and video. The exertion emphasized the use of population-based hybrid algorithm for FS problems. Devi et al. [29] proposed the Multi-Objective Firefly and Simulated Annealing for online feature selection where the KSVM classifier was used for classification. This scheme had the limitation of having only one classifier and the performance was not compared with the other classifiers.

For classification problems, DL techniques are considered to be efficient [30, 31]. Wan S et al. [30] proposed Deep multilayer perceptron classifier for Parkinson’s disease behaviour analysis. The proposed classifier has demonstrated enhanced performance in terms of accuracy compared with other algorithms. Young et al. [31] outlined the DL based ensemble approach for prediction in big data analytics. This work highlighted the issues of conventional mining, and proved the elevated performance level of Deep neural networks.

From the prevailing literature, it can be deduced that the bio-inspired algorithm combined with the MapReduce approach evidences to be effectual and competent in Feature selection (FS) methods in the field of big data analytics. It is evident that DMLP is used for classification problems.

Proposed methodology

MapReduce model is applied to big datasets, which is further divided into smaller partition. In the proposed, an efficient scalable Online Feature Selection (OFS) approach using the Accelerated Bat Algorithm (ABA) technique was recommended for OFS. In this approach, based on the threshold values, the feature weights are proportionally decremented and Clustering Coefficients of Variation (CCV) zeroed incognizant features weights. This work suggested an Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) classifier for large scale data. Also, we have analyzed the impact with a penalty and kernel parameters on the performance of EIDMLP classifier. The scalability of OFS- ABA over an extremely high dimensional and big dataset is proven through an empirical study which also illustrates that the algorithm performs supremely improved than the other known FS methods. The proposed model is shown in Fig. 2.

Fig. 2
figure 2

OFS-ABA and EIDMLP algorithm model

Preprocessing

Normalization is commonly used to maintain the balance of significance amongst the attributes, when attributes are on a diverse scale. When datasets are with diverse range of attributes, they are preprocessed by min–max normalization method. In this process all the values are transferred into same scale between 0 and 1, thus giving importance to the attribute even with the low range of value on scale.

It is the method of scaling the given dataset within the specified range of values between 0 and 1. From the Eq. (1), the normalized feature is derived.

$$n^{\prime} = \frac{{n - min_{ds} }}{{max_{ds } \hbox{-} min_{ds } }}_{ } \left( {n{\text{fmax }}{-}{\text{nfmin}}} \right) + {\text{nfmin}}$$
(1)

where n is the current value of the feature and n’ is the normalized value of the feature. minds is the minimum value and maxds is the maximum value of the given dataset. nfmax and nfmin is the normalization range 0, 1 respectively.

OFS

OFS [14] is related to streaming features. The interpretation of OFS [21] is represented by the notation Ds = [Ds1, Ds2,…,Dsn]T \(\in\) Rn×d, Where Ds1,…,Dsn is the given dataset with the feature set Fs = [fs1, fs2,…, fsd]T \(\in\) Rd and let Cl = [cl1, cl2,…,clm]T \(\in\) Rm denote the class label vector. Let d be the number of features which is unknown in priori, the best feature subsets are selected from d such that s < d. Accuracy will be achieved through only selecting the most relevant feature subset for classification. For every Dsi, feature weight vector \({\text{we}}_{\text{i}} \in {\text{R}}^{\text{d}}\) is learned which classifies the instance. After classification, \(we_{n}\) is updated to \(we_{n + 1}\). For features of streaming nature, the number of features is unknown priori, consequently this issue is well handled by OFS. The OFS acquire dataset instance one at a time. For every instance, a weight vector is learned and the class label of the instance is prophesied using the function sign \(\left( {we_{{n^{\prime}}} \times Ds_{n} } \right)\). Followed by, comparison of target and predicted class is done. The weight vector is rationalized using the following stochastic gradient rule given in Eq. (2) when the method misclassifies:

$$we_{n + 1} = we_{n} - \alpha C^{\prime}\left( {\left\langle {we_{n} ,Ds_{n} } \right\rangle , y_{n} } \right)$$
(2)

In Eq. (2), \(C^{\prime}\left( {we_{n} ,Ds_{n} , y_{n} } \right)\) implies the cost function and α denotes the rate of the learning. This procedure is concisely specified by EIDMLP classifier.

MapReduce

In MapReduce model, the given dataset \(\left( {Ds} \right)\) is spilt into number of smaller sets and distributed across the network [32] and for every single partition the feature selection algorithm is applied in parallel. The examples are equally distributed and processed in parallel so as to achieve the class balance. In MR, \(\left( {Ds_{i} } \right)\) is mapped into the corresponding \(map_{i}\) task. Throughout the mapping phase, \(Ds_{i}\) comprises of the OFS (in this case, based on ABA).

ABA is applied to each partitions, the output of each map function is represented as \(fe_{i} = \left( {fe_{i1} , \ldots ,fe_{iD} } \right)\), where the number of selected features is denoted by ‘D’. The reduce phase combines the features selected from each partitions, obtaining a vector ‘sf’ given in Eq. (3), where sfj denotes the jth feature.

This is the outcome of the complete OFS process, which is used for further ML process:

$$sf = \left\{ {sf_{1} , \ldots ,sf_{D} } \right\},\quad sf_{j} = \frac{1}{n}\mathop \sum \limits_{i = 1}^{n} fe_{ij}$$
(3)

where n = number of tasks in the MapReduce. Generally the reduce phase is carried out by a distinct process thus reducing the execution time in MR [33]. The entire execution is done with a single MR process which eliminated the added disk admittances.

Accelerated Bat Algorithm (ABA)

The Accelerated Bat Algorithm (ABA) is formed on the echolocation activities of bats. Bats collect the information of the streaming features. Microbats are capable of echolocation, a fascinating characteristic they possess to find optimal streaming features and classify. The process is given as follows [34, 35]:

  1. 1.

    Bats discover food and prey using echolocation.

  2. 2.

    Every bat has velocity vei, with a feature position fpi with freqmin fixed frequency, \(\lambda\) varying wavelength and A0 loudness. The pulse emission rate varies between \(er \in \left[ {0, 1} \right]\). The wavelength is modified accordingly.

  3. 3.

    The loudness is from A0 to Amin.

freq take the values in a range [freqmin, freqmax] that correlate to wavelengths \([\lambda_{min} , \lambda_{max}\)]. At time step t, outline the rules how feature position fpi and velocities vei in a higher dimensional population is given by the following Eqs. (4) to (6) [36].

$$freq_{i} = freq_{min} + \left( {freq_{max} - freq_{min} } \right)\beta$$
(4)
$$ve_{i}^{t} = ve_{i}^{t - 1} + \left( {fp_{i}^{t} - fp_{*} } \right)freq_{i}$$
(5)
$$fp_{i}^{t} = fp_{i}^{t - 1} + ve_{i}^{t}$$
(6)

where \(\beta \in \left[ {0, 1} \right]\) is a random vector drawn from a uniform distribution.

Here \(fp_{*}\) is the current global best solution which is updated on every iteration in comparison with the current position for ‘n’ number of features with the velocity \(\lambda_{i} freq_{i}\). The best feature is selected amongst the current best optimal feature using random walk

$$fp_{new} = fp_{old} + {\epsilon} {\rm A}^{t}$$
(7)

where \({\epsilon} {\in}\) [− 1, 1] is a random number with a loudness average \(A^{t} = \left\langle {A^{t} i} \right\rangle\). Both Ai and \(er_{i}\) of pulse emission rate is adapted consequently. The loudness and pulse emission rates are contrariwise proportional. To set the starting position for applying ABA, some feature ranker function and mutation operator is applied to augment the accuracy of the classifier. Initial position of the BA is rearranged by using the Gaussian Mutation operator. Let \(f_{i} \in \left[ {a_{i} , b_{i} } \right]\) be a real variable. Then the truncated Gaussian mutation operator changes \(f_{i}\) to a neighboring value using the following probability distribution [37]:

$$p\left( {f^{\prime}_{i} ,f_{i} ,\sigma_{i} } \right) = \left\{ {\begin{array}{*{20}l} {\frac{{\frac{1}{{\sigma_{i} }}\phi \left( {\frac{{f^{\prime}_{i} - f_{i} }}{{\sigma_{i} }}} \right)}}{{\phi \left( {\frac{{b_{i} - f_{i} }}{{\sigma_{i} }}} \right) - \phi \left( {\frac{{a_{i} - f_{i} }}{{\sigma_{i} }}} \right)}}\quad if\;a_{i} \le f^{\prime}_{i} \le b} \\ {0\qquad \qquad \qquad \quad otherwise} \\ \end{array} } \right.$$
(8)

where \(\varphi \left( z \right) = \frac{1}{{\sqrt {2\pi } }} exp \left( { - \frac{1}{2}z^{2} } \right)\) is the probability distribution of the standard normal distribution and Φ(·) is the cumulative distribution function.

This mutation operator has a mutation strength parameter σi for every features, which should be related to the bounds ai and bi, \(\sigma = \sigma_{i} /\left( {b_{i} {-} a_{i} } \right)\) as a fixed non-dimensionalized parameter for all m features. To implement the above concept use the following Eq. (9 to 11) to compute the offspring \(f_{i}^{\prime }\):

$$f^{\prime}_{i} = f_{i} + \sqrt 2 \sigma \left( {b_{i} {-} a_{i} } \right){\text{erf}}^{ - 1} \left( {u_{i}^{\prime } } \right)$$
(9)
$$u^{\prime}_{i} = \left\{ {\begin{array}{*{20}c} {2u_{L} \left( {1 - 2u_{i} } \right) , \quad if\;u_{i} \le 0.5} \\ {2u_{R} \left( {2u_{i} - 1} \right),\quad otherwise} \\ \end{array} } \right.$$
(10)
$${\text{erf}}^{ - 1} \left( {u^{\prime}_{i} } \right) \approx sign\left( {u^{\prime}_{i} } \right)\left( {\sqrt {\left( {\frac{2}{\pi \alpha } + \frac{{{ \ln }\left( {1 - u_{i}^{\prime 2} } \right)}}{2}} \right)^{2} - \frac{{{ \ln }\left( {1 - u_{i}^{\prime 2} } \right)}}{2}} - \left( {\frac{2}{\pi \alpha } + \frac{{\ln \left( {1 - u_{i}^{\prime 2} } \right)}}{2}} \right)} \right)^{1/2}$$
(11)

where \(\alpha = \frac{{8\left( {\pi - 3} \right)}}{{3\pi \left( {4 - \pi } \right)}}\) ≈ 0.140012 and sign \(\left( {u^{\prime}_{i} } \right)\) is − 1 if \(u^{\prime}_{i}\) < 0 and is + 1 if \(u^{\prime}_{i}\) ≥ 0. Also, uL and uR are calculated as follows

$$u_{L} = 0.5\left( {{\text{erf}}\left( {\frac{{a_{i} - f_{i} }}{{\sqrt 2 \left( {b_{i} - a_{i} } \right)\sigma }}} \right) + 1 } \right)$$
(12)
$$u_{R} = 0.5\left( {{\text{erf}}\left( {\frac{{b_{i} - f_{i} }}{{\sqrt 2 \left( {b_{i} - a_{i} } \right)\sigma }}} \right) + 1 } \right)$$
(13)

Thus, the Gaussian mutation procedure for mutating i-th feature variable fi is as follows:

  • Step 1: Create a random number ui [0, 1].

  • Step 2: Use Eq. (9) to create offspring \(f^{\prime}_{i}\) from parent \(f_{i}\)

CCV is used as fitness function [38] to select the optimal features with a balance between class and overfitting problem. This function is mainly applied for building an accurate prediction model. Higher the CV, the features are considered.

Let Ds be a dataset with n instances and m features. An instance \(\left( {f_{1} ,f_{2} , \ldots, a_{m} } \right)\) is divided into number of groups with classes \(c \in C\) is the total number of prediction target classes. For each \(f_{a} ,a \in \left[ {1..m} \right]\), \(f_{a} \in \left\{ {f_{a}^{1} ,f_{a}^{2} , \ldots, f_{a}^{c} } \right\}\)

$$v_{d} = \mathop \sum \limits_{c = 1}^{c} \frac{{\sqrt {\frac{{\left[ {\sum\nolimits_{n = 1}^{n} {\left( {f_{n}^{c} - \overline{{f_{a}^{c} }} } \right)^{2} } } \right]}}{n}} }}{{\overline{{f_{a}^{c} }} }}$$
(14)

vd is the sum of all coefficients of variation for each class c where \(c \in \left[ {1..C} \right]\), for that particular ath feature.

Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP)

The Deep Multiple Layer Perceptron (DMLP) is experimented to improve classification results of big data stream. An ensemble method combines the output of individual homogenous classifiers applied to the given big datasets. Output from each ensemble is selected and combined using majority voting rule [39]. DMLP is a multiple feed forward artificial neural network that maps input vectors to that of the output vectors. It is a connected graph with numerous layers namely input, hidden, and output layer. In this fully connected network, one to one layer connectivity is established. Also, it allows one or more hidden layers. Except for input layer nodes, rest of other neurons are associated with a nonlinear activation function. DMLP, an leeway of a single-layer perceptron corrects the weakness that single layer perceptron that cannot unravel nonlinear data with 3 ensembles. It can learn separable decisions non- linearly. It is shown in the Fig. 3.

Fig. 3
figure 3

(a) DMLP. b EIDMLP

In DMLP [30], five or ten hidden layers are implemented in contrast to the conventional simple two layer MLP. Generally, sigmoid and tanh activation functions show elevated performance in small to medium sized networks. By hard preventive the input of undesirable hidden nodes to zero, the function permits them to obtain sparse depictions. The term shortcut is the connection that span across the multiple layers. But, in DMLPs shortcuts are generally avoided hence all nodes of one layer is connected with the subsequent layer.

The following are the list of nodes(j) in the layers:

  • Succ(i) is where is connection i \(\to\) j exits.

  • Pre(i) is where is a connection j \(\to\) i exits.

For every connection between the layer i and layer j, the weight weji is assigned. All hidden and productivity nodes have a network input neti and aci activation output. In mining streaming data, data instances are generated frequently over the time. The issue arises to update the model every time without reloading the entire batch. This frequent updation even becomes very crucial if the data examples are massive. The model should be updated incrementally. To solve this incremental problem, an incremental DMLP classification model is proposed. The method is also named any-time algorithm as big dataset samples are read only once without the need to store or reload the samples every time. A tree is built by the induction method which selects an attribute to be classified by estimating the necessary indicators that registers the counts of every attribute value. To compute the frequency of attribute value atij of attribute ati corresponding to class yk, the Hoeffding Bound (HB) [38] is calculated using Eq (15).

$$HB = \sqrt {\frac{{R^{2} \ln \left( {\frac{1}{\delta }} \right)}}{2n}}$$
(15)

where R is the class distribution and n is the number of instances that are perceived to a class. At any particular time, the top value of H(.) called \(a{\text{t}}_{\text{ia}}\) = argmax (atia).

Similarly, the second top value is atib, where atib=argmax H (atij),≠. The two top values are incrementally taken as induction leaves and new data comes. The value ΔH (ati) = Δ H (atia) − Δ H (atib) is calculated for each attribute atib where i \(\in\) I as the difference between the two calculated top values. Using Eq. (16), a confidence interval is computed as rtrue, for the n number of instances till then. This is done to confirm the relation of attribute value atij to class yk. The confidence intervals are observed incrementally as the only statistics for each attribute ATi,  rHB ≤ rtrue< r+ HB where=(1/)Σri is retained. When the equality and rtrue < 1 hold true over the observed samples, then the tested \({\text{at}}_{\text{i}}\), is the best statistical candidate with good accuracy among part of the data stream in entirety.

The classified outputs of IDMLP classifier is now combined using Majority Voting. Let \(Tr_{i}\) be total set of examples (N) and CL be a set of output (Q) classes. Let S = {A1, A2, AM} be an algorithm set which contains the M classifiers to be used for voting. Every example TR \(r_{i}\), the prediction is made and the Classifier Q has all the predicted classes. Here, final class assigned is the class of each example predicted by the majority of classifiers by gaining majority votes which is explained as follows. Let cll CL denotes the class of an example ‘tr’ predicted by a classifier Al, and let a counting function Fk defined as:

$$F_{k} \left( {cl_{l} } \right) = \left\{ {\begin{array}{*{20}c} {1\quad cl_{l} = cl_{k} } \\ {0\quad cl_{l} \ne cl_{k} } \\ \end{array} } \right.$$
(16)

where \(cl_{l}\) and \(cl_{k}\) are the classes of CL. The sum of entire votes for class \(cl_{k}\) and it is defined via the use of the major vote function (\(mv_{Mk}\)):

$$mv_{Mk} = Tc_{k} = \mathop \sum \limits_{l = 1}^{M} F_{k} \left( {cl_{l} } \right)$$
(17)

S is the set of class that gained the majority of vote, with class cl for example tr is given as:

$$S\left( {tr} \right) = {\text{argmax}}_{{k \in \left\{ {1, \ldots Q} \right\}}} = Tc_{k}$$
(18)

Two strategies are used when more than one class conflicts with the same vote. The classes are arbitrarily chosen (SMV) in the first strategy whereas in the second strategy, an Influence Majority Vote (IMV) chooses the class given by the ensemble’s best classifier.

figure a

Experimental work

Dataset

The proposed methodology is implemented with benchmark mark dataset from the “UCI repository”. The characteristics of big datasets experimented is shown in the Fig. 4. Arcene dataset consists of 10,000 features and 900 instances. This dataset was merged from the mass spectrometry datasets. Based on the numerous feature characteristics, the cancer patient has to be identified from the healthier one from the dataset.

Fig. 4
figure 4

Datasets characteristics

The Dorothea dataset consists of various molecular properties of drug combination. The molecular features must be either active or inactive combination for drug formation. The classification task is to identify the molecules of binding nature or not. The identification of the binding property further leads to designing the new drug compounds with added properties like absorption, duration of action etc.,

The gisette dataset is used for hand written digit recognition, has 13,500 instances and 5000 features. The challenge here is to classify the digits four and nine. The distractive features were added into the dataset for feature selection.

Performance evaluation

The performance results are measured in expressions of Accuracy, Precision, Recall, F-measure, Processing time.

Sensitivity is also known as the true positive rate (TPR) which evaluates the amount of positives that are appropriately identified as positive.

$$TPR = Recall = \frac{TP}{P} = \frac{TP}{{\left( {TP + FN} \right)}}$$
(19)

From the confusion matrix (Fig. 5), precision is interpreted as follows:

$$Precision = \frac{TP}{{\left( {TP + FP} \right)}}$$
(20)
Fig. 5
figure 5

Confusion matrix

F-measure is the ratio precision and recall, given in Eq. (21).

$$F - measure = \frac{2*P*R}{{\left( {P + R} \right)}}$$
(21)

Accuracy (Eq. 22) is the unit of measurement that quantify how well the classifiers perform. It is the ratio of correctly predicted samples to the total number of tested samples.

$$Accuracy = \frac{{\left( {TN + TP} \right)}}{{\left( {TP + TN + FN + FP} \right)}}$$
(22)

Results and discussion

The results of the experiment are discussed in this division. The experiment is carried out in MATLAB environment, using Parallel Computing Toolbox and it is implemented on the system of 1 TB of HDD and 16 GB RAM capacity. The precision, recall, F-measure, and accuracy are the metrics used to assess the performance of this research work. MR-OFS-ABA method has shown enhanced performance than the existing feature selection methods namely PSO, APSO and ASAMO (Accelerated Simulated Annealing and Mutation Operator) [37, 40]. The result of the EIDMLP classifier is compared with other existing classifiers such as Naïve Bayes (NB), Hoeffding Tree (HT) and FMCCSC (Fuzzy Minimal Consistent Class Subset Coverage (FMCCSC)-KNN (K Nearest Neighbour). The methodology is applied to three datasets and results were compared with four classifiers and three state-of-the-art feature selection algorithms. All the datasets are preprocessed first. Then feature selection is carried out followed by classification which is discussed in “Proposed methodology” section. Tables 1, 2, and 3 consolidates the performance of the datasets Dorothea, arcene, and gisette.

Table 1 Classification results of Dorothea
Table 2 Classification results of Arcene
Table 3 Classification results of Gisette

Figure 6 depicts the accuracy comparison of the MR-OFS-ABA with EIDMLP classifier. The accuracy of Dorothea classification of active drug compounds are measured as 98.6%, 97.37%, 96.8%, 96.62%. The execution time is also substantially reduced in MR approach. From the Fig. 7, The execution time is 0.056, 0.068, 0.18, and 1.25.

Fig. 6
figure 6

Accuracy comparison

Fig. 7
figure 7

Processing time comparison

Figure 8 depicts the accuracy comparison of the MR-OFS-ABA with EIDMLP classifier. The accuracy of arcene dataset to analyze the patient is affected with cancer or not is measured as 99%. The execution time is also shown in Fig. 9, as 0.053, 0.062, 0.13, and 1.68.

Fig. 8
figure 8

Accuracy Comparison

Fig. 9
figure 9

Processing time comparison

The performance metrics for the gisette dataset is shown in Table 3.

Figure 10 depicts the accuracy comparison of the MR-OFS-ABA with EIDMLP classifier. The accuracy of gisette to identify the digits 4 or 6 is measured as 98.6%, 98%, 96.7%, 96.3%. The execution time is also shown in presented in the Fig. 11, as 0.44, 0.05,0.19,4.72.

Fig. 10
figure 10

Accuracy comparison

Fig. 11
figure 11

Processing time comparison

A receiver operating characteristic curve (ROC), is a graphical representation of classification model performance. The plot is drawn taking FPR, TPR along the axis x, y respectively. From the Fig. 12, the MROFS-ABA-EIDMLP curve is higher, thus proposed model performance is also higher.

Fig. 12
figure 12

a Dorothea, b Arcene, c Gisette

Conclusion

This paper focuses on innovative feature selection mechanism termed as OFS- Accelerated Bat Algorithm (ABA) is proposed to choose the most important features from online streaming features. The proposed OFS-ABA algorithm employs MapReduce (MR) perception in a streaming method towards assessment of improving the run time among features. Lastly, Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) classifier is anticipated to classify dataset samples. The methodology is applied to three datasets and results were compared with four classifiers and three state-of-the-art feature selection algorithms. In this research work, MR-OFS-ABA method has shown improved performance than the existing feature selection methods namely PSO, APSO and ASAMO (Accelerated Simulated Annealing and Mutation Operator). The outcome of the EIDMLP classifier is compared with other prevailing classifiers such as Naïve Bayes (NB), Hoeffding tree (HT) and FMCCSC (Fuzzy Minimal Consistent Class Subset Coverage (FMCCSC)-KNN (K Nearest Neighbour). The upshot of this work has shown heightened performance in accuracy and less processing time. In big data analytics, it is really challenging to combat the all the characteristics of big data. Indeed, in the proposed model the Volume, Variety, and Velocity is handled in a most proficient way. But, these characteristics may evolve into new dimensions in near future. The research challenge is to develop feature selection model for upcoming challenges and complexities.

Availability of data and materials

Datasets used in this work are available in UCI Machine Repository.

Abbreviations

OFS:

Online Feature Selection

ABA:

Accelerated Bat Algorithm

EIDMLP:

Ensemble Incremental Deep Multiple layer Perceptron

MR:

MapReduce

ML:

machine learning

DL:

deep learning

FS:

feature selection

NLP:

natural language processing

SVM:

support vector machine

NB:

Naïve Bayes

LR:

logistic regression

PFBP:

forward–backward with pruning

SIP:

semi-infinite programming

MKL:

multiple kernel learning

EC:

evolutionary computation

SAOLA:

Scalable and Precise Online Feature Selection

MOANOFS:

Multi-Objective Automated Negotiation based Online Feature Selection

DMLP:

Deep Multiple Layer Perceptron

HT:

Hoeffding tree

References

  1. AlNuaimi N, et al. Streaming feature selection algorithms for big data: a survey. Appl Comput Inform. 2019. https://doi.org/10.1016/j.aci.2019.01.001.

    Article  Google Scholar 

  2. Oussous Ahmed, et al. Big data technologies: a survey. J King Saud Univ Comput Inf Sci. 2018;30(4):431–48.

    Google Scholar 

  3. Dean J, Ghemawat S. MapReduce: simplified data processing on large clusters. Commun ACM. 2008;51(1):107–13.

    Article  Google Scholar 

  4. Chu CT, Kim SK, Lin YA, Yu Y, Bradski G, Olukotun K, Ng AY. Map-reduce for machine learning on multicore. In: Advances in neural information processing systems. p. 281–288; 2007.

  5. Dean J, Ghemawat S. MapReduce: a flexible data processing tool. Commun ACM. 2010;53(1):72–7.

    Article  Google Scholar 

  6. Athmaja S, Hanumanthappa M, Kavitha V. A survey of machine learning algorithms for big data analytics. In: International conference on innovations in information, embedded and communication systems (ICIIECS). p 1–4; 2017.

  7. Hinton GE, Osindero S, Teh Y-W. A fast learning algorithm for deep belief nets. Neural Comput. 2006;18(7):1527–54.

    Article  MathSciNet  Google Scholar 

  8. Bengio Y, Lamblin P, Popovici D, Larochelle H. Greedy layer-wise training of deep networks. In: Advances in neural information processing systems. p. 153–160; 2007.

  9. Dahl G, Ranzato M, Mohamed A-R, Hinton GE. Phone recognition with the mean-covariance restricted Boltzmann machine. In: Advances in neural information processing systems. Curran Associates, Inc; p. 469–77; 2010.

  10. Krizhevsky A, Sutskever I, Hinton G. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol. 25. Curran Associates, Inc; p. 1106–1114; 2012.

  11. Mikolov T, Deoras A, Kombrink S, Burget L, Cernock`y J (2011) Empirical evaluation and combination of advanced language modeling techniques. In: INTERSPEECH. ISCA. p. 605–608.

  12. Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23(19):2507–17.

    Article  Google Scholar 

  13. Liu H, Yu L. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng. 2005;17(4):491–502.

    Article  Google Scholar 

  14. Hoi SC, Wang J, Zhao P, Jin R. Online feature selection for mining big data. In: Proceedings of the 1st international workshop on big data, streams and heterogeneous source mining: algorithms, systems, programming models and applications. p. 93–100; 2012.

  15. Stefanowski J, Cuzzocrea A, Slezak D. Processing and mining complex data streams. Inf Sci. 2014;285:63–5.

    Article  Google Scholar 

  16. Gill SS, Rajkumar B. Bio-inspired algorithms for big data analytics: a survey, taxonomy, and open challenges. In: Big data analytics for intelligent healthcare management. Academic Press; p. 1–17; 2019.

  17. Peralta D, del Río S, Ramírez-Gallego S, Triguero I, Benitez JM, Herrera F. Evolutionary feature selection for big data classification: a MapReduce approach. Math Prob Eng. 2015;2015:246139.

    Article  Google Scholar 

  18. Tsamardinos I, Borboudakis G, Katsogridakis P, Pratikakis P, Christophides V. A greedy feature selection algorithm for Big Data of high dimensionality. Mach Learn. 2019;108(2):149–202.

    Article  MathSciNet  Google Scholar 

  19. Tan M, Tsang IW, Wang L. Towards ultrahigh dimensional feature selection for big data. J Mach Learn Res. 2014;15:1371–429.

    MathSciNet  MATH  Google Scholar 

  20. de La Iglesia B. Evolutionary computation for feature selection in classification problems. Wiley Interdiscip Rev Data Min Knowl Discov. 2013;3:381–407.

    Article  Google Scholar 

  21. Nazar NB, Senthilkumar R. An online approach for feature selection for classification in big data. Turk J Electr Eng Comput Sci. 2017;25(1):163–71.

    Article  Google Scholar 

  22. Hu X, Zhou P, Li P, Wang J, Wu X. A survey on online feature selection with streaming features. Front Comput Sci. 2018;12(3):479–93.

    Article  Google Scholar 

  23. Yu K, Wu X, Ding W, Pei J. Scalable and accurate online feature selection for big data. ACM Trans Knowl Discov Data (TKDD). 2016;11(2):16.

    Google Scholar 

  24. Fong S, Wong R, Vasilakos A. Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput. 2016;1:1–1.

    Article  Google Scholar 

  25. Said FB, Alimi AM(2018) MOANOFS: Multi-objective automated negotiation based online feature selection system for big data classification. arXiv preprint arXiv:1810.04903.

  26. Lin KC, Zhang KY, Huang YH, Hung JC, Yen N. Feature selection based on an improved cat swarm optimization algorithm for big data classification. J Supercomput. 2016;72(8):3210–21.

    Article  Google Scholar 

  27. Gu Shenkai, Cheng Ran, Jin Yaochu. Feature selection for high-dimensional classification using a competitive swarm optimizer. Soft Comput. 2018;22(3):811–22.

    Article  Google Scholar 

  28. Manoj RJ, Praveena MA, Vijayakumar K. An ACO–ANN based feature selection algorithm for big data. Cluster Comput. 2019;22:3953–60.

    Article  Google Scholar 

  29. Devi SG, Sabrigiriraj M. A hybrid multi-objective firefly and simulated annealing based algorithm for big data classification. Concurr Comput Pract Exp. 2019;31(14):e4985.

    Article  Google Scholar 

  30. Wan S, Liang Y, Zhang Y, Guizani M. Deep multi-layer perceptron classifier for behavior analysis to estimate Parkinson’s disease severity using smartphones. IEEE Access. 2018;6:36825–33.

    Article  Google Scholar 

  31. Young S, Tamer A, Ayse B. Deep super learner: a deep ensemble for classification problems. In: Canadian conference on artificial intelligence. Springer, Cham; 2018.

  32. Triguero I, Peralta D, Bacardit J, García S, Herrera F. MRPR: a MapReduce solution for prototype reduction in big data classification. Neurocomputing. 2015;150:331–45.

    Article  Google Scholar 

  33. Chu CT, Kim SK, Lin YA et al. Map-reduce for machine learning on multicore. In: Advances in neural information processing systems. p. 281–288; 2007.

  34. Yang XS. A new metaheuristic bat-inspired algorithm. In: Nature inspired cooperative strategies for optimization (NICSO 2010). Berlin: Springer; p. 65–74; 2010.

  35. Yang XS, Hossein Gandomi A. Bat algorithm: a novel approach for global engineering optimization. Eng Comput. 2012;29(5):464–83.

    Article  Google Scholar 

  36. Akhtar S, Ahmad AR, Abdel-Rahman EM. A metaheuristic bat-inspired algorithm for full body human pose estimation. In: Ninth conference on computer and robot vision. p. 369–75; 2012.

  37. Renuka Devi D, Sasikala S. Accelerated simulated annealing and mutation operator feature selection method for big data. Int J Recent Technol Eng. 2019;8:910–6.

    Google Scholar 

  38. Fong S, Wong R, Vasilakos AV. Accelerated PSO swarm search feature selection for data stream mining big data. IEEE Trans Serv Comput. 2016;9(1):33–45.

    Google Scholar 

  39. Bouziane H, Messabih B, Chouarfia A. Profiles and majority voting-based ensemble method for protein secondary structure prediction. Evol Bioinform. 2011;7:EBO-S7931.

    Article  Google Scholar 

  40. Sasikala S, Renuka Devi D. A review of traditional and swarm search based feature selection algorithms for handling data stream classification. In: Third international conference on sensing, signal processing and security (ICSSS), New York: IEEE; 2017.

Download references

Acknowledgements

Not applicable.

Funding

No external funding.

Author information

Authors and Affiliations

Authors

Contributions

Both authors read and approved the final manuscript.

Corresponding author

Correspondence to D. Renuka Devi.

Ethics declarations

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Renuka Devi, D., Sasikala, S. Online Feature Selection (OFS) with Accelerated Bat Algorithm (ABA) and Ensemble Incremental Deep Multiple Layer Perceptron (EIDMLP) for big data streams. J Big Data 6, 103 (2019). https://doi.org/10.1186/s40537-019-0267-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-019-0267-3

Keywords