Big data decision tree for continuous-valued attributes based on unbalanced cut points

The decision tree is a widely used decision support model, which can quickly mine effective decision rules based on the dataset. The decision tree induction algorithm for continuous-valued attributes, based on unbalanced cut points, is efficient for mining decision rules; however, extending it to big data remains an unresolved. In this paper, two solutions are proposed to solve this problem: the first one is based on partitioning instance subsets, whereas the second one uses partitioning attribute subsets. The crucial of these two solutions is how to find the global optimal cut point from the set of local optimal cut points. For the first solution, the calculation of the Gini index of the cut points between computing nodes and the selection of the global optimal cut point by communication between these computing nodes is proposed. However, in the second solution, the division of the big data into subsets using attribute subsets in a way that all cut points of an attribute are on the same map node is proposed, the local optimal cut points can be found in this map node, then the global optimal cut point can be obtained by summarizing all local optimal cut points in the reduce node. Finally, the proposed solutions are implemented with two big data platforms, Hadoop and Spark, and compared with three related algorithms on four datasets. Experimental results show that the proposed algorithms can not only effectively solve the scalability problem, but also have lowest running time, the fastest speed and the highest efficiency under the premise of preserving the classification performance.


Introduction
In the field of machine learning, big data refers to data that is too large to be treated using traditional machine learning algorithms.Therefore big data poses huge challenges to traditional machine learning algorithms [1].It is of great theoretical and application interest to study how to extend machine learning algorithms to handle big data.In previous studies, some researchers have extended some machine learning algorithms to the big data environment.For instance, as early as 2006, Chu et al. [2] deeply and extensively studied the scalability of machine learning algorithms for big data scenarios, and developed a broadly applicable parallel programming method by adapting the MapReduce paradigm; as a result, the proposed method can be easily applied to several learning algorithms, including locally weighted linear regression, k-means, logistic regression, naive Bayes, support vector machine, independent component analysis, principal component analysis, Gaussian discriminant analysis, expectation maximum algorithm, and back-propagation algorithm.Moreover, He et al. [3] considered the parallel implementation of several classification algorithms based on MapReduce; thus, they implemented the parallel K-Nearest Neighbor (K-NN) algorithm, the parallel naive Bayesian network, and the parallel decision tree, respectively.As for Xu et al. [4], they extended the k-means algorithms to big data environments with MapReduce; therefore, they proposed the k-means++ algorithm, which can not only address the problem of optimal selection of parameter K but can also improve the scalability and efficiency of the k-means algorithm.In addition, Duan et al. [5] proposed a parallel multi-classification algorithm for big data using the Extreme Learning Machine (ELM), and implemented it by the big data platform, Spark.In the literature, there has been some work on the scalability of the decision tree algorithms in the big data context.For instance, three parallel implementations of C4.5 decision tree algorithm by MapReduce were proposed in Wang and Gao [6], Mu et al. [7], and Dai and Ji [8], respectively.Furthermore, Wand et al. [9] conducted a study on optimizing and parallelizing of the C4.5 algorithm on the Spark platform, and proposed a Spark-based parallel C4.5 decision tree algorithm.Yuan et al. [10] also investigated the optimization of the C4.5 decision tree algorithm, but unlike [9], it is not on Spark platform but on MapReduce.Chern et al. [11] proposed a decision tree credit assessment approach to solve the credit assessment problem in a big data environment.Wang et al. [12] investigated the expansion of decision tree algorithm in big data environment from the perspective of data mining, taking RainForest Tree and Bootstrapped Optimistic Algorithm for Tree construction as examples.Regarding big data machine learning, including big data decision tree learning, several researchers have presented in-depth and comprehensive reviews [13][14][15][16][17].To sum up, we find that although there have been some extensions of decision tree algorithms in big data environments, the problem of the extension of continuous-valued decision tree algorithms, based on unbalanced cut points in big data environments is still not solved.Therefore, in this article, we proposed two solutions to solve this problem.This paper makes three significant contributions: 1. Two solutions, based on the divide-and-conquer strategy, are proposed to extend the continuous-value decision tree algorithm based on unbalanced cut points to big data scenarios.The first one is based on partitioning instance subsets, that is, the big data set is divided into several disjoint instance subsets.The second one is based on partitioning attribute subsets, that is, the big data set is divided into several disjoint attribute subsets.More intuitively, the first solution divides the big dataset into several disjoint subsets along horizontal direction; The second solution divides the big dataset into several disjoint subsets along vertical direction.2. In the induction process of the continuous-valued big data decision tree based on unbalanced cut points, how to find the global optimal cut point from multiple local optimal cut points is a key problem.The second contribution of this paper is to solve this problem.There is only one global optimal cut point, which is relative to the big dataset, and there are multiple local optimal cut points, which are relative to the subset of the big dataset.Each local optimal cut point is independently calculated from the local subset on a computing node of a big data platform; 3. Extensive experiments are conducted with the two big data platforms, Hadoop1 and Spark2 to verify the feasibility and effectiveness of the proposed method.Including the experiments compared with three closed-related algorithms on four big datasets, and the experiments on two artificial data sets to demonstrate the feasibility of the proposed algorithm.
The rest of this paper is organized as follows.In "Related work" section, we briefly review the related work.In "Two solutions" section, we describe the details of the proposed solutions.In "Experimental results and analysis" section, the experiments are conducted to demonstrate the feasibility and effectiveness of the proposed solution.At last, we conclude our work in "Conclusions" section.

Related work
As one of the top ten classical algorithms in machine learning [18], the decision tree has been widely used in classification and regression problems due to its fast learning speed and high prediction accuracy.However, in the big data environment, the decision tree cannot be completely constructed in memory due to the large amount of data to be processed, which requires a large amount of operation time (in case it can be realized).Accordingly, it is meaningful to study the way to extend the decision tree algorithm to a big data environment.Moreover, the methods of extending the decision tree algorithm and its variants to big data environments are mainly divided into two categories: distributed parallel crisp decision trees and distributed fuzzy decision trees.
The former is mainly based on the distributed extension of ID3, C4.5, CART, SPRING, and other baseline algorithms, so as to improve the accuracy and efficiency of the algorithm on big data; however, there is still a need to consider the computational complexity of the decision tree algorithm itself, the resident memory, and other characteristics of the optimization before implementing this algorithm.Based on MapReduce, two parallelized C4.5 algorithms were proposed in Wang and Gao [6] and Mu et al. [7].Moreover, Genuer et al. [19] extended the random forest algorithm for big data classification.Based on the work of Genuer et al. [19], Juez-Gil et al. [20] proposed the rotation forest algorithm for big data classification.Furthermore, Shivaraju et al. [21] proposed a parallel decision tree based on attribute partition.In this case, the whole dataset was divided first into the training set for building a decision tree and the test set for testing the decision tree model according to certain rules.Then, the attributes of both datasets were partitioned, and a decision tree was generated on the training set according to each partition.Moreover, according to the partition of the test set and the training tree, the test tree was generated, and the weighted voting method was used in the generated test tree to deliver the final prediction classification result.To solve the time-consuming problem relative to the calculation of the information gain rate of the C4.5 algorithm, Yuan et al. [22] proposed an improved C4.5 algorithm, which used the McLaughlin formula to calculate the information gain in order to improve computational efficiency.By comparing the improved C4.5 with the traditional C4.5 on the Hadoop platform, the results show that the improved algorithm has higher accuracy and efficiency.Added to that, Desai and Chaudhary studied the scalability of the distributed decision tree and proposed two distributed decision tree algorithms: DDTv1 [23] and DDTv2 [24].In more detail, DDTv1 is a distributed implementation of the Hadoop-based Distributed Decision Tree (DDT) and the Spark-based Spark Tree (ST).As for DDTv2, it optimizes DDT and ST based on DDTv1 and makes a compromise between partitioning and accuracy.In other words, DDTv2 provides as much parallelism as possible without any loss of accuracy.For instance, Chen et al. [25] proposed a parallel random forest algorithm for big data; the tests were performed on Spark.To improve the generalization performance of the parallel random forest algorithm, it is optimized based on the hybrid approach combining data-parallel and task-parallel optimization.Therefore, Es-sabery et al. [26] proposed an improved ID3 algorithm and used it for the classification of Twitter big data.The main contributions of the proposed work included three points: (1) The information gain feature selector was used to reduce the dimensionality of Twitter's high-dimensional data; (2) Regarding the dimension reduction, the weighted information gain was used to replace the information gain in the ID3 algorithm as a heuristic to induce the decision trees; (3) The improved ID3 algorithm was implemented with the big data programming framework, MapReduce.Moreover, Jurczuk et al. [27] proposed a global induction of the classification trees for large-scale data mining using multi-GPU for accelerating the algorithm.Abuzaid et al. [28] introduced an optimized system for training deep decision trees at scale; it was based on partitioning the feature of the data along with a set of optimized data structures to reduce the CPU occupancy and communication costs of training.Chen et al. [29] designed a distributed decision tree algorithm and implemented it with the big data platform Spark.In addition, En-nattouh et al. [30] applied decision tree algorithm to select the optimal configuration and enhance parameter optimization in clustered platforms for Big Data.Specifically, the number of tasks for each node by analyzing the internal elements of YARN was first calculated, the decision tree is then used to find the optimal configuration.To reduce the time complexity in parsing big data decision tree, Liu et al. [31] designed a novel data structure termed as collision-partition tree, which can lead a more balanced tree structure, thus achieving the goal of reducing the computational time complexity.Jin et al. [32] introduced a sampling scheme with and without the replacement and designed an algorithm to improve the adaptation and generalization ability of classification rules in a big-data environment.Online decision tree algorithms can tackle the problem of big data learning by concurrently training with incoming samples and providing inference results.Based on this point, Lin et al. [33] proposed a high-performance and scalable online decision tree learning algorithm.Weinberg et al. [34] proposed an algorithm to select a representative decision tree from an ensemble of decision-tree models for fast big data classification.
The distributed fuzzy decision tree is an extension of the fuzzy decision tree in a big data environment for handling large-scale data learning.Compared to the distributed crisp decision tree, the research on the distributed fuzzy decision tree is relatively limited.Four representative works are Segatori et al. [15], Mu et al. [35], Wu et al. [36], and Fernandez-Basso et al. [37].In Segatori et al. [15], proposed a distributed fuzzy decision tree learning scheme based on the MapReduce programming model where the learning scheme can generate both binary and multiway fuzzy decision trees from this big data.The key idea of this learning scheme is that it uses fuzzy information entropy to discretize each Continuous-Valued attribute.In Mu et al. [35], proposed a parallel fuzzy rule-base based decision tree, the main contribution of this work lay in developing a parallel fusing fuzzy rule based classification system; moreover, the parallelization was implemented via MapReduce, and the ensemble was used to evaluate the obtained fuzzy rule-base.In [36], a Hadoop-based high fuzzy utility pattern mining algorithm was developed to discover high fuzzy utility patterns from big datasets.In Fernandez-Basso et al. [37], gave a Spark-based solution for discovering the fuzzy association rules associated with big data.
As far as we know, the problem of how to extend the continuous-valued decision tree induction algorithm based on unbalanced cut points to a big data environment, has not been solved; therefore the main goal of this paper is to solve this problem.

Two solutions
The goal of this paper is to extend the continuous-valued decision tree induction algorithm based on unbalanced cut points to big data scenarios.This section first briefly reviews the baseline algorithm, and then introduces the two solutions: one based on partitioning instance subsets, and the other based on partitioning attribute subsets.

The continuous-valued decision tree algorithm based on unbalanced cut points
The continuous-valued decision tree algorithm based on unbalanced cut points was proposed by Fayyad and Irani in 1992 [38], the core concept of this algorithm is the unbalanced cut point.Given a continuous-valued decision table The intuitive representation of the decision table D is shown in Table 1.The values in column j (i.e., the values of attribute a j ) of Table 1 are sorted in ascending order, the sorted result is denoted by x i 1 j , x i 2 j , • • • , x i n j .The mid-value t s of x i s j and x i s+1 j is called a cut point of attribute a j , 1 ≤ s ≤ n − 1 and 1 ≤ j ≤ d .Obviously, ∀a j ∈ A , it has n − 1 cut points.If the samples x i s and x i s+1 on both sides of the cut point t s belong to different classes, then t s is called the unbalanced cut point.Otherwise, it is called the balanced cut point.

Table 1 A decision table containing n instances
Given a continuous-valued decision table D = (U , A ∪ Y ) , the instances in U are classi- fied into k classes, the number of instances in class i is n i , 1 ≤ i ≤ k .The Gini index of U is defined as: where p i = n i n .Similar to information entropy, Gini index also measures the uncertainty of the classes to which the instances belong.
Given a continuous-valued decision table The Gini index of t s is defined by It can be seen from ( 2) that the Gini index of the cut point t s is the average value of the Gini index of the two subsets U 1 and U 2 .In other words, the Gini index of the cut point t s measures the uncertainty of the classes of the two subsets partitioned by the cut point t s .Obviously, the smaller the Gini index of the cut point t s , the more important the cut point t s is. For , the optimal cut point of attribute a j is a cut point that satisfies the following condition: The global optimal cut point is defined as the optimal cut point with respect to the attribute set A, which is a cut point that satisfies the following condition.
The attribute corresponding to the optimal cut point t * is called the optimal extended attribute.Regarding the global optimal cut point, the following theorem holds [38].

Theorem 1 The global optimal cut point must be a unbalanced cut point.
This theorem suggests that when searching for the optimal cut point, only the Gini index of the unbalanced cut points need to be calculated, whereas it is unnecessary to calculate the Gini index of the balanced cut points, which can greatly reduce the computational complexity and improve the efficiency of the algorithm.For each attribute a ∈ A , let T a is the set of all cut points of a.The pseudo-code of continuous-valued decision tree algorithm based on unbalanced cut points is given in Algorithm 1. (1) ( When the continuous-valued decision table D is a big dataset, Algorithm 1 will become infeasible, so how can Algorithm 1 be extended to big data scenarios?The general strategy of big data processing is divide and conquer, that is, the big dataset is divided into several subsets which are distributed to different computing nodes for parallel processing.Big dataset can be divided into subsets in both horizontal and vertical directions.The partition in horizontal direction is to divide the big dataset into instance subsets, while the partition in vertical direction is to divide big dataset into attribute subsets.In this paper, based on horizontal and vertical partitioning, we give two solutions to extend the continuous-valued decision tree algorithm based on unbalanced cut points to big data scenario.

The solution based on partitioning instance subsets
In this solution, the cut points of an attribute a j (1 ≤ j ≤ d) fall into two categories: 1. Cut points within a subset, or a computing node, which are the cut points corresponding to the local data subset and can be viewed as local cut points; 2. Cut points between two subsets, or two computing nodes, which are naturally generated by dividing big dataset into several subsets.
If the number of computing nodes is m (i.e., the number of subsets or partitions of big dataset), then the number of this kind of cut points is m − 1 (see Fig. 1).If we artifi- cially control the partition of big dataset, so that the m − 1 cut points are all balanced cut points, then it is unnecessary to calculate the Gini index of the m − 1 cut points, and the difficulty of processing will be reduced.This solution needs to overcome the following two difficulties.
1.When calculating the Gini index of local cut points in a node, it needs to use the information of subsets on other nodes, so how to calculate the Gini index of local cut points across nodes is a difficult problem to be solved; 2. After the d local optimal cut points corresponding to the d attributes are found, then we need to find the global optimal cut point from the d local optimal cut points.
Finding the global optimal cut point is another difficulty to be overcome.
The schematic diagram of the solution based on partitioning instance subsets is illustrated in Fig. 2.
The attribute corresponding to the global optimal cut point is the most important attribute, which is used as the extension attribute of the continuous-valued decision tree.In the following, we first take the cut point t (2)  1 of the attribute a j as an example to explain how to calculate the Gini index of the cut points within a subsets.From Fig. 1, we can find that the cup point t (2)  1 partitions the subset D 2 into two sub- sets {x (2) i 1 j } and {x (2) i n 2 j } .On the other hand, from the view point of big dataset, the cup point t (2)  1 partitions the big dataset U into two subsets U 1 and U 2 , where . According to Eq. ( 2), after calculating the Gini index of U 1 and U 2 , we can easily calculate the Gini index of the cut point t (2)  1 .To calculate the Gini index of U 1 and U 2 , we only need to count the number of instances in U 1 and U 2 that belong to each category respectively, and it is easy to accomplish.
For the general case, let t (q) p be the pth cut point of the qth subset D q , 1 ≤ p ≤ n q , 1 ≤ q ≤ m .The cut point t (q) p partitions the subset D q into two subsets {x i nq j } .At the same time, the cut point t (q) p partitions the big data U into two subsets U 1 and U 2 , where i p j } , and Similarly, in order to calculate the Gini index of the cut point t (q) p , we only need to count the number of instances in U 1 and U 2 that belong to each category respectively.It should be noted that when we count the number of instances belonging to each class in the map phase, the number of instances in the local subset is counted in parallel on each node.After the map phase is completed, the statistical results of all nodes are summarized in the reduce phase to obtain the number of instances belonging to each class in the big dataset.The design of the corresponding map and reduce functions are given in Algorithms 2 and 3, respectively.
In the framework of MapReduce, The corresponding big data decision tree algorithm based on partitioning instance subsets is denoted by BS-CDT-MR, and its pseudo-code is given in Algorithm 4.
In the framework of Spark, the pseudo-code of the corresponding algorithm denoted by BS-CDT-SP is given in Algorithm 5.

The solution based on partitioning attribute subsets
The partition based on attribute subsets is to use attribute subsets to divide the big dataset into several data subsets, intuitively speaking, that is, to divide the big dataset into several data subsets along the vertical direction.Each data subset corresponds to the result of a projection operation in the database system.Each data subset contains all instances, but each instance is represented only by the values of the attributes in the attribute subset, that is, the dimension of the instance vector is the potential of the attribute subset.For example, if a subset of attributes contains three attributes, then each instance is represented by the values of the three attributes, which is a three-dimensional vector.This solution consists of partitioning the dataset with attribute subsets and deploying these subsets in the computing node so that all cut points of the same attribute are on one computing node.Therefore, the counting of the unbalanced cut points and the calculation of the Gini index are similar to those in the standalone environment, and the counting and calculation are relatively easy.The difficulty of this solution is that the user needs to design the partition and the implementation scheme.Hence, a java class, MTextInputFormat, is developed and it is extended from the Hadoop class FileInputFormat, and two functions createRecordReader() and getSplits() are overloaded to split the big dataset with attribute subsets.The partition scheme is given in Fig. 3, the schematic diagram of the solution based on partitioning attribute subsets is illustrated in Fig. 4.
If the large dataset U has J attributes, the big data platform has K computing nodes.Obviously, an instance of U consists of J attribute values and a class label.After partitioning the big dataset equally, each subset A i (1 ≤ i ≤ K ) , has ⌈ J K ⌉ attribute columns and a class label column.Since Hadoop function getSplits() does not support splitting a dataset by attribute subset, it must be overload.Specifically, if the first column is a class label, the ith data of the first row is used as the starting position of the current split, and the ith data of each subsequent row is read into the split to end flag delimiter, indicating that the partitioned big dataset with the attribute a i (1 ≤ i ≤ K ) is complete.Finally, the Hadoop function createRecordReader() is modified to use the column offset as the key and the contents of the attributes and class label columns as the value.The pseudo-code of the corresponding map and reduce functions are given in Algorithm 6 and 7 respectively.In the framework of Spark, the pseudo-code of the corresponding algorithm denoted by BA-CDT-SP is given in Algorithm 9.

Experimental results and analysis
To demonstrate the effectiveness of the proposed solutions, we conducted experiments on a big data platform with 6 computing nodes using two open-source framewoeks: MapReduce and Spark.The configuration of the big data platform is given in Table 2, and the configuration of computing nodes in the big data platform is given in Table 3.It should be noted that in the big data platform, the configuration of the master node and the slave node are same.
We compared the proposed algorithms with five methods on five datasets, including two artificial and three UCI datasets.The first artificial dataset, denoted by Gauss-ian1, is two-dimensions.The instances in this dataset are divided into two classes, where both follow the Gaussian distribution.The corresponding parameters are given in Table 4.The second artificial dataset, denoted by Gaussian2, is four-dimensions.The instances in this dataset fall into four classes, all follow the Gaussian distribution.The corresponding parameters are given in Table 5.The basic information of the five datasets is given in Table 6.The five comparison methods are Parallel C4.5 [7],  Table 4 The parameters of the first artificial set Gaussian1 FRBDT (Parallel Fuzzy Rule-base Based Decision Tree) [35], IS-C4.5 (Improvement Strategy for C4.5) [6], Weka J48-MR in Hadoop machine learning library, and MLlib DT-SP in Spark machine learning library.The parallel C4.5, based on MapReduce and Spark, are denoted by Parallel C4.5-MR and Parallel C4.5-SP, respectively.The indexes for the experimental comparison are the test accuracy and the running time.
Since the test accuracy of the same algorithm implemented by different platforms will not change significantly, the experimental comparison results of the test accuracy Table 5 The parameters of the second artificial set Gaussian2    8 and 9, respectively.From the experimental results, displayed in Table 7, one can find the following: 1.There is no significant difference in test accuracy between the two algorithms based on instance subset partition (i.e., BS-CDT-MR and BS-CDT-SP) and the two algorithms based on attribute subset partition (i.e., BA-CDT-MR and BA-CDT-SP) on the five datasets.The test accuracy of the first two algorithms based on attribute subset partition on datasets Gaussian2 and SUSY is slightly higher than that of the second two algorithms based on instance subset partition.This is mainly due to the fact that all four algorithms use the Gini index as a heuristic, but the partition mode and implementation framework of the big datasets are different; 2. Compared to the other six algorithms (Parallel C4.5-MR, Parallel C4.5-SP, FRBDT, IS-C4.5, Weka J48-MR, and MLlib DT-SP), there is no significant difference in the test accuracy on the five datasets.
Referring to the experimental results shown in Table 8, the two algorithms proposed in this paper have the least running time, the fastest speed, and the highest efficiency compared to the four algorithms based on MapReduce.The reason is that the proposed algorithms are based on the unbalanced cut point.Therefore, when calculating the heuristic, only the Gini index of the unbalanced cut points needs to be calculated, which is not the case for the balanced cut points.From a statistical point of view, the computing time complexity can be reduced by half.Moreover, from the experimental results in Table 9, one can get a conclusion similar to that of Table 8.That is, the two algorithms proposed in this paper have the lowest running time, the fastest speed and the highest efficiency.We believe that the reason is that the solution based on partitioning attribute subsets makes all cut points of each attribute located on the same computing node, and the calculation of the Gini index and the local optimal cut point does not require communication between the computing nodes, while the solution based on the partitioning instance subsets must communicate between computing nodes to calculate the Gini index and the local optimal cut point for each attribute.
In order to more intuitively show the difference in running efficiency when two different solutions are implemented with two different big data frameworks, MapReduce and Spark, we visualized the running time on five datasets.Figure 5 presents the comparison of the running time of the solution based on partitioning attribute subsets under two frameworks, MapReduce and Spark.It can be clearly seen from Fig. 5 that for the same partitioning method (based on partitioning attribute subsets), the implementation of two different frameworks, MapReduce and Spark, the difference of running time is very significant.The result is similar for the case based on partitioning instance subsets, see Fig. 6.We think there are two main reasons.First, Spark constructs DAG directed acyclic graphs when processing big data.Compared with MapReduce, the times used for shuffle operation can be reduced in most cases, thus reducing a large amount of sorting time.Furthermore, from the experimental results, we found a strange phenomenon, among the five datasets used in the experiment, Covertype had the smallest size, but all the algorithms spent the most time on this dataset.We believe that it is precisely because of their small size, MapReduce and Spark are difficult to fully use their advantages, when processing such data sets, many small files will be generated, and these small files will produce a lot of I/O operations, thus consuming a lot of time.

Conclusions
In this paper, two solutions are proposed to extend the continuous-valued decision tree induction algorithm based on unbalanced cut points to the big data scenarios.The first solution is based on the instance subset partition along the horizontal direction.The technical difficulty of this solution is to calculate the Gini index of the unbalanced cut points across nodes.The second solution is based on attribute subset partition along the vertical direction.The technical difficulty of this solution is that users need to design a big data partition scheme based on attribute subsets.In this scheme, the calculation of the Gini index of unbalanced cut points across nodes is not needed.For these two technical difficulties, this paper gives corresponding perfect solutions, and proposes a continuous-valued big data decision tree induction algorithm based on unbalanced cut points.The proposed algorithm is implemented on two big data platforms, MapReduce and Spark, and the experimental comparison is conducted with five related algorithms in terms of test accuracy and running

Fig. 1
Fig. 1 Schematic diagram of cut points within subset and cut points between subsets

Fig. 2
Fig. 2 The schematic diagram of the solution based on partitioning instance subsets

Fig. 3 Fig. 4
Fig.3 The design scheme of partitioning attribute subsets

Fig. 5 AFig. 6 AFig. 7 A
Fig. 5 A comparison of the running time of the solution based on partitioning attribute subsets under two frameworks

Fig. 8 A
Fig. 8 A comparison of the running time of the two solutions under the framework Spark

Table 2
The configuration of the big data platform

Table 3
The configuration of the nodes in the big data platform

Table 6
The basic information of the five datasets

Table 7
The experimental comparison results of test accuracyThe data in bold indicate the highest test accuracy of different algorithms on different data sets

Table 7 .
Whereas the algorithms are implemented by MapReduce and Spark, resulting in significant different running time.The experimental comparison results of the running time of the algorithms implemented by MapReduce and Spark are given in Tables

Table 8
The experimental comparison results of running time with MapReduceThe data in bold indicate the shortest run times for different algorithms implemented by MapReduce and Spark on different datasets, respectively

Table 9
The experimental comparison results of running time with SparkThe data in bold indicate the shortest run times for different algorithms implemented by MapReduce and Spark on different datasets, respectively