Optimal instance subset selection from big data using genetic algorithm and open source framework

Data is accumulating at an incredible rate, and the era of big data has arrived. Big data brings great challenges to traditional machine learning algorithms, it is difficult for learning tasks in big data scenario to be completed on stand-alone. Data reduction is an effective way to solve this problem. Data reduction includes attribute reduction and instance reduction. In this study, we focus on instance reduction also called instance selection, and view the instance selection as an optimal instance subset selection problem. Inspired by the ideas of cross validation and divide and conquer, we defined a novel criterion called combined information entropy with respect to a set of classifiers to measure the importance of an instance subset, the criterion uses multiple independent classifiers trained on different subsets to measure the optimality of an instance subset. Based on the criterion, we proposed an approach which uses genetic algorithm and open source framework to select optimal instance subset from big data. The proposed algorithm is implemented on two open source big data platforms Hadoop and Spark, the conducted experiments on four artificial data sets demonstrate the feasibility of the proposed algorithm and visualize the distribution of selected instances, and the conducted experiments on four real data sets compared with three closely related methods on test accuracy and compression ratio demonstrate the effectiveness of the proposed algorithm. Furthermore, the two implementations on Hadoop and Spark are also experimentally compared. The experimental results show that the proposed algorithm provides excellent performance and outperforms the three methods.

minimal consistent subset of the training set. However, the consistent subset found by CNN may not be the smallest. To deal with this drawback, Gates [2] proposed reduced nearest neighbor (RNN). Based on the relative significance of the instances in training set, Dasarathy [3] proposed another algorithm to find the minimal consistent subset of the training set. In addition, CNN is sensitive to noise. To this end, Wilson and Martinez [4] proposed edited nearest neighbor (ENN). Based on CNN, the researchers also proposed some other improved algorithms. For instance, Brighton and Mellish [5] introduced the concepts of reachable set and coverage set into CNN and proposed an iterative case filtering algorithm. Angiulli [6] introduced the idea of Voronoi partition into CNN and proposed fast CNN. Li and Maguire [7] proposed a critical pattern selection algorithm by considering local geometrical and statistical information, which selects both border and edge instances from training set. Hernandez-Leal et al. [8] introduced an instance ranking per class using borders, and proposed an instance selection algorithm using the ranking information. Cavalcanti and Soares [9] also proposed a ranking based instance selection algorithm. Different from [8], the algorithm calculates a score for each instance given its relations with other instances in the training set, and then selects instances according to the score.
The algorithms mentioned above are all k-nearest neighbor based methods, and the researchers have also proposed instance selection methods based on other learning algorithms. Liu et al. [10] proposed an efficient self-adaption instance selection algorithm reconstructing training set for support vector machine from the viewpoint of geometry. Aslani and Seipel [11] adopted locality-sensitive hashing for developing an instance selection method which rests on rapidly finding similar and redundant training instances and excluding them from the training set. Based on clustering technique, Chen et al. [12] proposed an instance selection algorithm for speeding up support vector machines. Ant colony optimization (ACO) algorithm performs two primary functions: boundary detection and boundary instance selection. Based on ACO, Akinyelu et al. [13] proposed an instance selection algorithm for SVM speed optimization. Shao et al. [14] combined instance and feature selection, and proposed an uniform sparse primal and dual LSSVM model. Furthermore, Du et al. [15] found that if the feature and instance selection are addressed separately, the irrelevant features may mislead the process of instance selection. In order to deal with this problem, they proposed a unified framework, which selects instances and features simultaneously. Liaw [16] proposed a framework for cooperative evolutionary learning and instance selection in an adaptive manner, and an effective data evaluation of representativeness of promising solution for evolutionary instance selection. Chen et al. [17] proposed a sample selection method based on genetic algorithm (GA). Arnaiz-González et al. [18] proposed a new technique for instance selection and noise filtering for regression, the technique uses instance selection for classification after output value discretization, and demonstrates well robustness to noise. In addition, Arnaiz-González et al. [19] also proposed fusion of instance selection algorithms for regression tasks to improve the selection performance. Malhat et al. [20] proposed a sample selection algorithm based on global probability density and correlation functions. Czarnowski [21] proposed a sample selection algorithm based on cluster analysis. Most of the above instance selection algorithms are only applicable to small and medium data sets since the entire data set needs to be loaded into memory when selecting instances from a data set. However, these algorithms may become infeasible for big data, when the size of the training set far exceeds the memory capacity of the computer. There have been few studies on instance selection for big data, and only a few researchers have explored this topic. Based on the locally sensitive hashing technique, Arnaiz-González et al. [22] proposed an instance selection algorithm for large data with linear computational time complexity. In addition, they extended the instance selection algorithm based on democratic distance to the big data environment and implement the algorithm with MapReduce [23]. In the framework of instance reduction, Triguero et al. [24] proposed a MapReduce based instance reduction method, which can conduct instance reduction on big data, thus realizing the classification of big data. Based k-nearest neighbor graph, Mall et al. [25] propose a method to select representative subsets from big data set to realize the learning of big data. Based on random mutation hill climbing and MapReduce, Si et al. [26] proposed a big data instance selection algorithm. Inspired by the idea of cross validation and divide and conquer, we proposed an approach to select optimal instance subset from big data using genetic algorithm and open source framework. The main contributions of this paper lie in the following three folds.
1. We defined a novel criterion which combines information entropies with respect to multiple classifiers to measure the importance of an instance subset. Because multiple independent classifiers are used to evaluate the optimality of an instance subset, the evaluation is consistent with human cognition and is more reasonable. 2. Based on the criterion, and inspired by the ideas of cross validation and divide and conquer, we proposed an approach to select optimal instance subset from big data using genetic algorithm and open source framework. 3. We implemented the proposed algorithm on two open source big data frameworks, Hadoop 1 and Spark. 2 Experiments on four artificial data sets demonstrate the feasibility of the proposed algorithm and visualize the distribution of selected instances. Experiments on four real data sets compared with three closely related methods on test accuracy and compression ratio demonstrate the effectiveness of the proposed algorithm.
The rest of this paper is organized as follows. In "Extreme learning machine" section, we present the preliminaries used in this paper. In "The proposed algorithm" section, we describe the details of the proposed approach. In "Experimental results and analysis" section, the experiments are conducted to demonstrate the feasibility and effectiveness of the proposed approach. At last, we conclude our work in "Conclusions" section. Zhai

Extreme learning machine
In this section, we will briefly review the extreme learning machine (ELM) [27] that will be used in the proposed method. ELM is an algorithm for training random weighted network that is a single hidden layer feedforward neural network (SLFN) with a special architecture. The particularity of the architecture of SLFN is that the activation function of all nodes in the input layer is linear function y = x , that of all nodes in the output layer is also linear function y = x , and that of all nodes in the hidden layer is sigmoid function y = 1 1+e −x . In addition, all nodes of the hidden layer have bias, while all nodes of the output layer have no bias. The weights w i between the input layer and the hidden layer and the biases b i of the nodes of the hidden layer are randomly assigned, where 1 ≤ i ≤ m and m is the number of the nodes of the hidden layer. The weights β i between the hidden layer and the output layer are obtained by solving the following optimization problem: where Y is the expected output matrix, H is the output matrix of hidden layer.
Given a training set S = {(x i , y i )|1 ≤ i ≤ n} , the smallest norm least-squares solution of Eq. (1) can be obtained by Eq. (2).
where H † is the Moore-Penrose generalized inverse of H , and H † = HH T −1 H.

The proposed algorithm
In this section, we present the details of the proposed algorithm. An overview of the proposed method is shown in Fig. 1. First, the big data set is partitioned into k subsets, and k-1 subsets are used to train classifiers via MapReduce. Then, the trained classifiers are used to classify instances from the remaining subset. Finally, genetic The technical route of the ith round for selecting instances algorithm is used to select informative instances that are more likely being misclassified by these classifiers from the remaining subset. This section is organized as follows, details of classifier training and calculating information entropy of subset are introduced in "Training classifiers and calculating information entropy of subset" section, Optimal subset selection using genetic algorithm is introduced in "Optimal subset selection via genetic algorithm" section, and computational time complexity is analyzed in "Computational time complexity" section.

Training classifiers and calculating information entropy of subset
For the convenience of description, we present the details of the proposed algorithm in the framework of Hadoop MapReduce. Since the algorithm is inspired by the idea of k-fold cross validation, it is a k-round iterative algorithm. Similar to the method of k-fold cross validation, we first partition a big data set D into k subsets, D 1 , D 2 , . . . , D k , and then we use the genetic algorithm to select informative instances from a subset In the following, we present the details of ith round for selecting informative instances from D i with genetic algorithm. Let R i = D − D i . Obviously, R i is a big data set. In the ith round, R i is automatically partitioned into m splits by map mechanism of MapReduce, and the m splits are deployed to m map computing nodes, where m is the number of map nodes in a big data computing platform. In addition, the subset D i is broadcast to m map nodes.
On the m map nodes, complete the following works in parallel:

Training SLFN classifier by ELM
The m SLFN classifiers are trained in parallel at m nodes on the local splits R ij (1 ≤ j ≤ m) by ELM algorithm. The reason why we choose ELM algorithm to train SLFN is that it has a very fast learning speed and excellent generalization ability [27]. The m SLFN classifiers denoted by SLFN 1 , SLFN 2 , . . . , SLFN m form a committee.

Classifying the instances in D i by the trained SLFNs
Given an instance x ∈ D i , the classification result of x by a SLFN on a map node can be transformed into a posterior probability distribution by the Eq. (3).
In Eq. (3), ω l stands for the lth class, L is the number of classes.

Calculating the information entropy of the instances in D i
The information entropy of instance x with respect to classifier SLFN j is defined by Eq. (4). On a reduce node, complete the following works: 1. Calculating the average information entropy of the instances in D i with respect to the committee.
The average information entropy of instance x in D i with respect to the committee is defined by Eq. (5).

Calculating the information entropy of a subset of D i
Given a subset Q ⊆ D i , the information entropy of Q is defined by Eq. (6).
3. Selecting the optimal subset of D i by genetic algorithm. The details of selecting optimal subset using genetic algorithm with Eq. (6) is presented in the next section.

Optimal subset selection via genetic algorithm
In this section, we demonstrate how the optimal instance subset is selected. The optimal instance subset consists of a number of informative instances. It is believed that hard instances that are more likely being misclassified by the classifiers are more informative to the classifiers. The reason is that for classifiers, easy instances that are more likely being correctly classified by the classifiers cannot contribute much to the losses of the classifiers, therefore do not contribute to the training of the classifiers. The misclassified instances are more informative and contribute more to the losses, and the classifiers learn most from the incorrectly classified instances and in turn adjust the decision boundaries. A naive approach is to select hard instances using hard thresholds, e.g. selecting instances with scores above or below a threshold or selecting top-k instances. However, it is very hard to find a global hard threshold which works well for all subset partitions. Therefore, we propose to search dynamic thresholds using genetic algorithm (GA) [41][42][43]. GA uses population search strategy to find the optimal solution of the addressed problem. A population consist of some individuals, and an individual is a candidate solution which is usually encoded as a character string. There are three key issues for instance selection using genetic algorithm, which are (1) Regarding to the issue (1). Given an individual Q ∈ P , namely, a subset of D i , P is a population. It is suitable to encode Q by 0 and 1. If |D i | = n i (1 ≤ i ≤ k) , then it is obvious that the length of binary strings used for representing Q is n i .
Regarding to the issue (2). Given an individual Q ∈ P , we use Eq. (6) as the fitness function to evaluate the goodness of Q. As discussed before, the goal is to select hard instances which are more likely being misclassified by the classifier. And Eq. (6) can naturally measure how confident the classifier is or how likely each instance is being misclassified by the classifier. Larger entropy values indicate less confidence and more errors. To sum up, the larger E(Q) is, the harder Q is, therefore the more important Q is.
In genetic algorithm, the genetic manipulations include selection, crossover and mutation. We use roulette wheel method to select two parent individuals from a population according to their fitness (the better fitness, the bigger chance to be selected). The roulette wheel method include the following four steps: 1. Calculate selection probability. For each individual Q i in the population P, let |P| = N . The selection probability of Q i is calculated by We use one-point crossover with a crossover probability p c to cross over the parents to form new offsprings, and we randomly select a position in an individual with a mutation probability p m to mutate (i.e. change 1 to 0, or change 0 to 1) to form a new individual. The pseudo-code of the algorithm for instance selection by genetic algorithm is given in algorithm 1.

Computational time complexity
In this section, we analyze the computational time complexity of the proposed algorithm. At each map node, the main operations include (1)

Experimental results and analysis
We conducted two experiments to demonstrate the feasibility and effectiveness of the proposed approach respectively.

Experiment 1: demonstrating the feasibility of the proposed algorithm
In this section, we conduct experiments on four artificial data sets and visualize the selected instances to verify the feasibility of the proposed algorithm. The first and the second artificial data sets have clear classification boundaries between different categories. While the third and fourth artificial data sets do not have clear classification boundaries between different categories, i.e., there is overlap between the instances belonging to different classes.
The first artificial data set Circle is a two-dimensional data set, which consists of 1000 data points belonging to two classes, 500 data points per class. The points of the class 1 are uniformly distributed into a circle of radius 0.3 centered on (0.5, 0.5). The points of the class 2 are uniformly distributed into a ring centered on (0.5, 0.5) with internal and external radii equal to 0.3 and 0.5, respectively. The distribution of the instances in the artificial data set Circle, and the distribution of the instances selected from the data set Circle by the proposed algorithm are shown on the left and right of Fig. 2  The third artificial data set Gaussian1 is a two-dimensional data set with three classes followed three Gaussian distributions, it contains 1500 data points, 500 data points per class. The mean vectors and covariance matrices of the three Gaussian distributions are given in Table 1. The distribution of the instances in the artificial data set Gaussian1, and the distribution of the instances selected from the data Fig. 2 The distribution of the instances in the artificial data set Circle (left), and the distribution of the instances selected from the artificial data set Circle (right) Fig. 3 The distribution of the instances in the artificial data set Square (left), and the distribution of the instances selected from the artificial data set Square (right) Table 1 The mean vectors and covariance matrices of three Gaussian distributions in Gaussian1  Fig. 4 respectively. The fourth artificial data set Gaussian2 is a three-dimensional data set with three classes followed three Gaussian distributions, it contains 1500 data points, 500 data points per class. The mean vectors and covariance matrices of the three Gaussian Fig. 4 The distribution of the instances in the artificial data set Gaussian1 (left), and the distribution of the instances selected from the artificial data set Gaussian1 (right) Table 2 The mean vectors and covariance matrices of three Gaussian distributions in Gaussian2 distributions are given in Table 2. The distribution of the instances in the artificial data set Gaussian2, and the distribution of the instances selected from the data set Gauss-ian2 by the proposed algorithm are shown on the left and right of Fig. 5 respectively. From the visualized results presented in Figs. 2 and 3, it is found that the data points selected from artificial data sets Circle and Square by the proposed algorithm are distributed near the classification boundary. While from the visualized results presented in Figs. 4 and 5, it is found that the data points selected from artificial data sets Gaussian1 and Gaussian2 by the proposed algorithm are distributed in overlapping areas of different classes. The experimental results are consistent with the instance selection criteria or heuristic given in formula Eq. (6).
In addition, we also compared the test accuracies of the classifiers trained on the four original artificial data sets with that trained on the selected instance subsets, the test accuracies are denoted by test accuracy1 and test accuracy2 respectively. The classifier is a single hidden layer feedforward neural network trained by extreme learning machine. The results of the experimental comparison are listed in Table 3. In Table 3, the bold values indicate the maximum value of Test accuracy1 and Test accuracy2.
From the experimental results listed in Table 3, we can find that the test accuracies of the classifiers trained on the selected instance subsets of Circle and Square are slightly lower than test accuracies of the classifiers trained on the two original data sets, Circle and Square. While the test accuracies of the classifiers trained on the selected instance subsets of Gaussian1 and Gaussian2 are higher than test accuracies of the classifiers trained on the two original data sets, Gaussian1 and Gaussian2. The experimental results verified that the algorithm proposed in this paper is feasible. In the second experiment, we will demonstrate the effectiveness of the proposed algorithm by comparing it with three closely related methods.

Experiment 2: demonstrating the effectiveness of the proposed algorithm
In this section, we demonstrate the effectiveness of the proposed method by comparing it with three closely related methods on test accuracy and compression ratio. The three compared methods are MapReduce based condensed nearest neighbor method (denoted by MR-CNN), Spark based condensed nearest neighbor method (denoted by S-CNN), and locality sensitive hashing based instance selection method (denoted by LCS-IS) [22]. In the following, we first introduce the experimental environment and data sets, and then introduce the parameter setting used, finally present the comparisons with three related methods. In addition, we also compared the two implementations of the proposed algorithm on big data platform Hadoop and Spark in terms of testing accuracy, compression ratio and running time, and visualized the experimental results.

The experimental environment and data sets
The experiment 2 is conducted on a big data platform with 8 computing nodes, the configuration of computing nodes in the big data platform is given in Table 4. It should be noted that in the big data platform, the configuration of the master node and the slave node are same. The data sets used for experiments include four UCI data sets, the basic information of the four used data sets is given in Table 5.

The parameter settings
The parameter settings include the setting of the number of hidden layer nodes in SLFN, and the setting of parameters in genetic algorithm. First, an ablation study on the   number of hidden layer nodes in SLFN is conducted. More specifically, we train SLFN on a subset randomly selected from big data set, and the size of the subset roughly equal to n m , where n is the size of a big data set, and m is the number of nodes in a big data platform. We train different SLFNs with different number of hidden layer nodes, and record the test accuracy of corresponding SLFNs. The results of the ablation study on different data sets are given in Fig. 6. The setting of the number of hidden layer nodes in SLFN is listed in column 5 of the Table 6. Based on our prior knowledge on the parameters in genetic algorithm [45,46], the setting of parameters in genetic algorithm is listed in the column 2-4 of Table 6. The number of iterations is not set as a parameter, the   termination condition of the genetic algorithm is adaptive for different data sets. Specifically, when the value of fitness function does not increase, the algorithm will terminate.

The comparison with three related methods
We implemented the proposed approach on two open-source big data platforms, Hadoop and Spark, the two implementations are denoted by GA-MR-IS and GA-S-IS respectively. We experimentally compared GA-MR-IS and GA-S-IS with three related methods on test accuracy and compression ratio, the experimental results are given in Tables 7 and 8 respectively. In Tables 7 and 8, the bold values indicate the maximum accuracy of the five methods. It can be seen from the experimental results listed in Tables 7 and 8 that, the proposed method is superior to the three compared methods in both the test accuracy and compression ratio. We think the reasons include the following three aspects: 1. The proposed method selects the optimal instance subset, and the optimality is given by an expert committee whose members are independent from the candidate instance set. 2. The proposed method can overcome the following three shortcomings of CNN, while MR-CNN and S-CNN can not.
• CNN is especially sensitive to noise, because noisy instances will usually be misclassified by their neighbors, and thus will be retained. • CNN is also sensitive to the order of instances presented to the algorithm to decide whether or not to select. • There are also redundant instances in subset selected by CNN.
3. In the proposed method, we defined a novel criterion to measure importance of instance subset, the measure integrates the wisdom of all the members of a classifier committee. While LCS-IS uses the generated hash function to measure the importance of instances, the measure has high uncertainty.

The comparison of two implementations
We also conducted experimental comparison between the two implementations (i.e., GA-MR-IS and GA-S-IS) in terms of testing accuracy, compression ratio and running time, and visualized the experimental results, as shown in Figs. 7, 8 and 9. It can be seen intuitively from the visualization results shown in Figs. 7 and 8 that the two implementations have little difference in test accuracy and compression ratio. The reasons for this are easy to understand, since GA-MR-IS and GA-S-IS have the same mechanism for selecting optimal instance sets. But it can be seen from Fig. 9 that the running times of GA-MR-IS and GA-S-IS are significantly different, this is because MapReduce and Spark use different mechanisms to process big data. MapReduce uses batch processing mechanism to handle big data, while Spark uses memory computing mechanism to handle big data. In Zhai and Huang [47], we analyze the computational time complexity of the two implementations in detail, which can theoretically explain the reason for the significant difference. Fig. 8 The comparison of the compression ratio of two implementations Fig. 9 The comparison of the running times of two implementations