Comparative study between incremental and ensemble learning on data streams: Case study
- Wenyu Zang^{1, 2}Email author,
- Peng Zhang^{2},
- Chuan Zhou^{2} and
- Li Guo^{2}
https://doi.org/10.1186/2196-1115-1-5
© Zang et al.; licensee Springer. 2014
Received: 13 December 2013
Accepted: 20 January 2014
Published: 24 June 2014
Abstract
With unlimited growth of real-world data size and increasing requirement of real-time processing, immediate processing of big stream data has become an urgent problem. In stream data, hidden patterns commonly evolve over time (i.e.,concept drift), where many dynamic learning strategies have been proposed, such as the incremental learning and ensemble learning. To the best of our knowledge, there is no work systematically compare these two methods. In this paper we conduct comparative study between theses two learning methods. We first introduce the concept of “concept drift”, and propose how to quantitatively measure it. Then, we recall the history of incremental learning and ensemble learning, introducing milestones of their developments. In experiments, we comprehensively compare and analyze their performances w.r.t. accuracy and time efficiency, under various concept drift scenarios. We conclude with several future possible research problems.
Keywords
Background
We are now entering the era of big data. In government, business and industry domains, big data are generated rapidly and steadily, with a constant growth speed at a magnitude of million records per day. Moreover, these data are often related in temporal and spatial correlations. Typical examples include the wireless sensor data, RFID data and Web traffic data. These data often arrives unboundedly and rapidly, which forms a new class of data called “big stream data”.
The focus on learning from big stream data is how to addressing the concept drifting challenge. Concept drift was first introduced by Wdimer and Kubat [1], where they noticed that the concept (the classification boundary or clustering centers) continuously changes with time elapsing. Based on the changing speed of concept, we formally divide the concept drifting into loose concept drift and rigorous concept drift [2]. In the former, concepts in adjacent data chunks are sufficiently close to each other; in the latter, genuine concepts in adjacent data chunks may randomly and rapidly changed.
Incremental learning [3] and ensemble learning [4] are two fundamental methods in learning from big stream data with concept drift. Incremental learning follows a machine learning paradigm where the learning process taking place whenever new examples emerge, and then adjusts to what has been learned from the new examples. While the ensemble learning employs multiple base learners and combines their predictions. The fundamental principle of dynamic ensemble learning is to dividing large data-stream into small data chunks and training classifiers on each data chunk independently. The most prominent difference of incremental learning from traditional machine learning is that incremental learning does not assume the availability of a sufficient training set before the learning process, but the training example appears over time. Moreover, the biggest difference between incremental learning and ensemble learning is that ensemble learning may discard training data outdated but incremental learning may not.
Although these two types of methods have their own strengths in data streams mining. However, the comparisons between them are rare. A.Tsymbal [5] described some types of concept drift and related works to handle it. Nevertheless, it not clearly categorizes the incremental and ensemble learning algorithms. In addition, they did no experiments on different learning framework.
In this paper we comparative study the incremental learning and ensemble learning algorithms. In addition, we compare performance between them in both accuracy and efficiency. Furthermore, some suggestions are given for choosing a better classifier.
This paper is organized as follows. In section Incremental learning, review and summarize the incremental learning algorithms. In section Ensemble learning, ensemble learning algorithms are learned and classified. In section Experiment results, incremental learning and ensemble learning algorithms are analysis and compare in a unified standard. The experiment results and discussions are given in section Conclusion and section 6.
Incremental learning
Generally, classification problem is defined as follows. A set of N training examples of the form (x, y) is given, where y is a discrete class label and x is a vector of d attributes (each of which may be symbolic or numeric). The goal is to produce from these examples a model y = f(x) which will predict the classes y of future examples x with high accuracy.
To solve this problem, traditional statistic analysis method would load all training data into memory at once. However, compared to the explosive growth of today’s information, the storage capacity is far from desirable. Moreover, when it comes to temporal series traditional data mining algorithms have showed limitations. Incremental learning algorithms are efficient method to these problems.
According to the differences of basic data learning method, incremental learning method can be sorted as there categories: incremental decision tree, incremental Bayesian and incremental SVM. According to the number of new instances to be added in a model at a time, it can be sorted as instance-by-instance learning and block-by-block learning.
Incremental decision tree
VFDT (very fast decision tree) [6] and CVFDT (concept-adapting very fast decision tree) [7] are two classical and impactive algorithms in incremental decision tree algorithms.
VFDT (very fast decision tree) Algorithm was first proposed by Domingos and Hulte in 2000. The author used hoeffding bounds verified that we can use a small sample of the available examples when choosing the split attribute at any given node and the output is asymptotically nearly identical to that of a conventional learner.
Select G(X _{ i }) be the heuristic measure used to choose test attributes. Let X _{ a } be the attribute with best heuristic measure and X _{ a } be the second best attribute. Let $\u25b3\overline{G}=G\left({X}_{a}\right)-G\left({X}_{b}\right)$. Applying the Hoeffding bound to $\u25b3\overline{G}$, if $\u25b3\overline{G}>\epsilon $, we can confidently select X _{ a } as the split attributes. So VFDT is a real-time system and able to learn from large amount of data within practical time and memory constraints.
But comes to rigorous concept drift, VFDT has its own limitations. In order to solve this problem, Hulten and Spencer proposed CVFDT (concept-adapting very fast decision tree) algorithm [7] in 2001 based on VFDT. In CVFDT, each internal node has a list of alternate sub-trees being considered as replacements for the sub-tree rooted at the node. It also supports a parameter which limits the total number of alternate trees being grown at any one time. Each node with a non-empty set of alternate sub-trees, l _{ test }, enters a testing mode to determine if it should be replaced by one of its alternate sub-trees. l _{ test } collects the next m training examples that arrives to compare the accuracy of the sub-tree it roots with the accuracies of all of its alternative sub-trees. If the most accurate alternate sub-tree is more accurate than the l _{ test }, l _{ test } is replaced by the alternate. CVFDT also prunes alternate sub-trees during the test phase. For each alternative sub-tree of l _{ test }, ${l}_{\mathit{\text{all}}}^{i}$, CVFDT remembers the smallest accuracy difference ever achieved between the two, $\u25b3\mathit{\text{min}}\left({l}_{\mathit{\text{test}}},{l}_{\mathit{\text{all}}}^{i}\right)$. CVFDT prunes any alternate whose current test phase accuracy difference is at least $\u25b3\mathit{\text{min}}\left({l}_{\mathit{\text{test}}},{l}_{\mathit{\text{all}}}^{i}\right)+1\%$. By this means of sub-tree, CVFDT can adapt itself to concept drift well than VFDT.
In summary these two algorithms are both real-time method for data-stream mining. CVFDT is faster than VFDT and also adapts better to concept drift. While VFDT cost less memory than CVFDT.
Incremental Bayesian algorithm
In Bayesian algorithm priori probability P(θ|S, I _{0}) is a known quantity. While in the incremental Bayesian the priori probability change into P(θ|S, I _{0}) considering incoming new training instances. What we are concerning with is how to update a priori probability incrementally.
Firstly, make the following stipulation to some marks. The sample’s space S is composed of attribute space I and class space C. Which is denote S = {S _{1}, S - 2, …, § _{ n }} = < I, C >. Each sample S _{ i } = {a _{1}, a _{2}, …, a _{ m }, c _{ l }}, the attribute is denoted by A _{ i }, whose value is {a _{ ik }}, and class attribute C is composed by I discrete values (c _{1}, c _{2}, …, c _{ l }). The task of classifier is to learning the attribute space I and class space C, then finding out the mapping relation between them. Only one c _{ i } in class attribute set C = (c _{1}, c _{2}, …, c _{ l }) will be found to correspond given any one sample s _{ i } = {a _{1}, a _{2}, …, a _{ m }} ∈ I. That is to say existing only one c _{ i } for each instance x = (a _{1}, a _{2}, …, a _{ m })∈I, let P(c = c _{ i }|x) ≥ (j = 1, 2, …, l).
Where A _{ ik } is the k _{ th } value of attribute A _{ i }, |A _{ i }| is the number of values in attribute A _{ i }. |D| is the size of training samples.
In summary, Bayesian Algorithm itself has incremental property. For the incoming training instances with labels, it is easy to complement an incremental algorithm. Otherwise, with instances without labels, we discusses the sampling policy and various classifying loss expressions to simplifies and improves the classifiers.
Incremental SVM
The two core concepts of SVM algorithm are mapping input vectors into a high dimensional feature space and structural risk minimization. There is a useful property in SVM algorithm: classification equivalence on SV set and the whole training set. Based on this property, Incremental SVM [11–18] can be trained by preserving only the SVs at each step, and add them to the training set for the next step. According to different situations, there are different ways to select training set at each step.
The problems discussed in Incremental SVM algorithm are how to discarding history samples optimally and how to selecting new training instances in successive learning procedure. But there is still some intrinsic difficulties. Firstly, Support vectors (SVs) is highly depended on kernel functions you selected. Secondly, when concept drift happens, previous support vectors could be useless.
Decision tree algorithms, Bayesian learning algorithms and SVM algorithms are three main algorithms in data mining. The problem we discussed in incremental algorithm is how to using old training result accelerating the successive learning procedure. Incremental decision tree (hoffding tree or VFDT) uses a statistic result (hoffding bounds) to guaranteeing that we can learn from abundant data within practical time and memory constraints. Incremental Bayesian algorithm updates the prior probability dynamically according to the incoming instances. Incremental SVM is based on the classification equivalence of SV set and the whole training set. So we can add only support vectors (SVs) to the incoming training set for incrementally training a new model. In these three algorithms, Incremental decision tree and Incremental Bayesian algorithms are based on experience risk minimization. While Incremental SVM is based on structural risk minimization. Incremental decision tree and Incremental Bayesian algorithm is faster and Incremental SVM algorithm has better a generalization ability.
All of these algorithms above update a classifier dynamically using the new coming data. On one hand, we need not to load all data into memory at once. On the other hand, we can real-time modify the classification model according to the new training instances. Moreover, the classifier can adapt to concept drift via real-time updating to new data. However, there are still shortcomings and limitations in incremental learning algorithms. For example, it can only unceasing absorb new data-streams, it cannot remove old instances in the classification model. Because of these shortcomings, incremental algorithms will be helpless when comes to rigorous concept drift.
Ensemble learning
The fundamental principle of dynamic ensemble learning is to dividing large data-stream into small data chunks. Then training classifiers on each data chunk independently. Finally, it develops heuristic rules to organize these classifiers into one super classifier.
This structure has many advantages. Firstly, each data chunk is relatively small so that the cost of training a classifier on it is not high. Secondly, we saved a well trained classifier instead of the whole instances in the data chunk which cost much less memory. Thirdly, it can adapt to various concept drifts via different weighing policies. So the dynamic ensemble learning models can cope with both unlimited increasing amounts of data and concept drift problems in data-stream mining.
There are many heuristic algorithms for ensemble learning. According to the ways of forming the base classifiers, it can be roughly divided into two classes: horizontal ensemble framework and vertical ensemble framework.
Horizontal ensemble framework
Where α _{ i } is the weighting value assigned to the i _{ th } data-chunk. f _{ i }(x) is the classifier trained on the i _{ th } data-chunk. And the 1 t oN is the data-chunks selected.
Weighting policy is the most important method in ensemble learning to guarantee accuracy. Street [19] proposed a SEA algorithm, which combined all the decision tree models using majority-voting. In this algorithm ${\alpha}_{i}=\frac{1}{N}\left(i=1,2,\dots ,N\right)$. Kolter [20] also proposed a Dynamic Weighted Majority (DWM) algorithm. Yeon [21] proved majority-voting is the optimum solution in the case of no concept drift. In order to tracing the concept drift, Wang [22] proposed an accuracy-weighted ensemble algorithm, in which they assign each classifier a weight reversely proportional to the classifier’s accuracy on the up-to-data chunk. In this algorithm α _{ i } = - (MSE _{ i } - MSE _{ r }), where ${\mathit{\text{MSE}}}_{i}=\frac{1}{\left|{S}_{n}\right|}{\Sigma}_{(x,c)\in {S}_{n}}{\left(1-{f}_{c}^{i}\left(x\right)\right)}^{2}$ is the mean square error of f _{ i }(x). S _{ n } is the training set. MSE _{ r } = Σ _{ c } p(c)(1 - p(c))^{2} is the mean square error of a random classifier. C is the labels of all instances. Tsymbal [5] proposed a dynamic integration of classifiers in which base classifier is given a weight proportional to its local accuracy. Zhang [23] develop a kernel mean matching (KMM) method to minimize the discrepancy of the data chunks in the kernel space for smooth concept drift and an Optimal Weight values for classifiers trained from the most recent data chunk for abrupt concept drift. Yeon [21] proposed an ensemble model has a form of a weighted average and ridge regression combiner. In this proposed algorithm a angle between the estimated weights and optimal weight is used to estimate concept drift, when concept drift is smooth ${\alpha}_{i}=\frac{1}{N}\left(i=1,2,\dots ,N\right)$ otherwise ${\alpha}_{i}={\mathit{\text{arg}}}_{w}\mathit{\text{min}}{\Sigma}_{i=1}^{n}{\left({y}_{i}-{\Sigma}_{j=1}^{m}{\alpha}_{j}{f}_{j}\left({x}_{i}\right)\right)}^{2}+\lambda {\Sigma}_{j=1}^{m}{\alpha}_{j}^{2}$ subject to ${\Sigma}_{j=1}^{m}{\alpha}_{j}=1,{\alpha}_{j}>0$ where y _{ i } is the label of instance. m is the number of classifiers and n is the number of instances. In this algorithm a penalty coefficient is employed to trace different level of concept drift.
As to instance selection, weighted instance and data discarded policy et al. are discussed. Fan [24] proposed a benefit-based greedy approach which can safely remove more than 90% of the base models and guarantee the acceptable accuracy. Fan [25] proposed a simple, efficient and accurate cross-validation decision tree ensemble method to discard old data and combine with new data to construct the optimal model for evolving concept. Zhao [26] proposed a pruning method (PMEP) to obtain the ensembles at a proper size. Lu [27] proposed a heuristic metric that considers the trade-off in accuracy and diversity to select the top p percent of ensemble members, depending on their resource availability and tolerable waiting time. Kuncheva [28] proposed a concept of “forgetting” by ageing at a variable rate.
Vertical ensemble framework
Where β _{ i } is the weighting value assigned to the i _{ th } classifier. And f _{ in }(x) is the i _{ th } classifier trained on the n _{ th } data-chunk.
In vertical ensemble framework, classifier diversity is a primary factor to guarantee accuracy. Zhang [29] proposed a semi-supervised ensemble method: U _{ D } EED. It works by maximizing accuracies of base learners on labeled data while maximizing diversity among them on unlabeled data. Zhang [2] proposed an Optimal Weight values for classifiers in the case of abrupt concept drift, in this algorithm all classifiers using different learning algorithms, e.g., Decision Tree, SVM, LR, and then builds prediction models on and only on the up-to-data data chunk. Minku [30] show that low diverse ensemble obtain low error in the case of smooth concept drift while high diverse ensemble is better when abrupt concept drift happens.
The weighting policy in horizontal framework is almost commonly used in the vertical framework. It is also method like voting majority, weighted based on accuracy and weighted through a regression algorithm and so on.
In a word, the core idea of ensemble learning is to organizing different weak classifiers into one strong classifier. The main method used in ensemble learning is divide-and-conquer. In ensemble learning large data-stream is divided into small data-chunks, and we train classifiers on each chunk independently. The difficult problems we discussed mostly in ensemble learning are as follows. First, what base classifier should we choose? Second, how to set the size of a data-chunk? Third, how to assign weighting values to different classifiers? Finally, how to discard previous data? As to setting the size of a data-chunk, large data-chunk is more robust while small data-chunk adapts better to concept drift. And the weighting policy direct influence on accuracy.
Experiment results
The aim of the experiments is to comparing the incremental learning with the ensemble learning algorithms. In incremental learning algorithms incremental decision tree (include VFDT and CVFDT), incremental Bayesian algorithm and incremental SVM were experimental verified. In ensemble learning algorithms horizontal framework and vertical ensemble framework were implemented. AWE was chosen to represent horizontal ensemble framework. In all the compared algorithms we compare basic characteristics on popular synthetic and real life data sets.
All of the tested algorithms were implemented in Java as part of the MOA and Weka framework. We implemented the AWE algorithms and implement incremental SVM in Libsvm, while all the other algorithms were already a part of MOA or Weka. The experiments were done on a machine equipped with an AMD Athlon (tm) II X3 435 @2.89 GHz Processor and 3.25 GB of RAM. To make the experimental more reliable, we experiment every algorithm on each data stream (from different starting point) for 10 times and calculated the mean and variance based on these values in the experimental. T-test was used for Significance Testing. Classification accuracy was calculated using the data block evaluation method, which works similarly to test-then-train paradigm. This method reads incoming examples without processing them, until they form a data block of size d. Each new data block is first used to test the existing classifier, and then it updates the classifier.
Synthetic and real data streams in experiment
In this part all five data-streams used in the experiment will be listed. There are four synthetic data-streams (Hyperplane 1, Hyperplane 2, Hyperplane 3 and KDDcup99) and one real data-streams (sensor data-stream).
In these data-streams Hyperplane 1, Hyperplane 2 and Hyperplane 3 are generated by Hyperplane generator in moa. They all have 9 attributes and one label with 2 classes, and there are 800,000 instances in each of the data-streams. The difference between these three synthetic data-streams is that they have different level of concept drifts. Hyperplane 1 has no concept drift. Hyperplane 2 has median level of concept drift and Hyper plane 3 has abrupt concept drift. Kddcup99 stream was collected from the KDD CUP challenge in 1999, and the task is to build predictive models capable of distinguishing between intrusions and normal connections. Clearly, the instances in the stream do not flow in similar way as the genuine stream data. In this data-stream each instance has 41 attributes and one label with 23 classes. Sensor stream contains information (temperature, humidity, light, and sensor voltage) collected from 54 sensors deployed in Intel Berkeley Research Lab. The whole stream contains consecutive information recorded over a 2 months period (1 reading per 1-3 minutes). sensor ID is used as the class label, so the learning task of the stream is to correctly identify the sensor ID (1 out of 54 sensors) purely based on the sensor data and the corresponding recording time. While the data stream flow over time, so does the concepts underlying the stream. For example, the lighting during the working hours is generally stronger than the night, and the temperature of specific sensors (conference room) may regularly rise during the meetings. So there are 5 attributes and a label of 54 classes in this data-stream.
Competitive study
Incremental learning and ensemble learning are two major solutions to large-scale data and concept drift in big stream data mining. Incremental learning is a style of learning where the learner updates its model of the environment where a new significant experience becomes available. And ensemble learning adopts a divide-and-conquer method to organize different base classifiers into one super classifier. They both can handle infinitely increasing amount of data and time series. Moreover, they both meet the real-time demands. Besides the above advantages they shared together each algorithm has its own relative merits. It will be discussed in detail in the followings.
Competitive study on accuracy and efficiency
Incremental learning and ensemble learning are two major solutions to large-scale data and concept drift in today’s data stream mining. Incremental learning is a style of learning where the learner updates its model of the environment where a new significant experience becomes available. And ensemble learning adopts a divide-and-conquer method to organize different base classifiers into one super classifier. They both can handle infinitely increasing amount of data and time series. Moreover, they both meet the real-time demands. Besides the above advantages they shared together each algorithm has its own relative merits. It will be discussed in detail in the followings.
Competitive study on various concept drift
Incremental algorithm cannot adapt well to sudden concept drift. That is because almost of the incremental algorithms update its model according to incoming data-streams but it never discard history knowledge. For examples, in incremental Bayesian algorithms, priori probability is updated smoothly according to incoming instances. In incremental SVM algorithms, support vectors (SVs) are directly related to decision plane and kernel function. So it is very sensitive to concept drift. Only CVFDT an incremental decision algorithm can process time-changing concept by growing an alternative sub-tree. But it costs additional space to save alternative paths which decrease its efficiency dramatically.
In compared with incremental algorithms, ensemble learning algorithms is more flexible to concept drift. Firstly, it can set the size of data chunk to fit different level of concept drift: small data chunk for sudden concept drift and large data chunk for smooth concept drift. Secondly, it can assign different weighting values to different base classifiers to satisfy various concept drift. Thirdly different policy to select and discard base classifiers also helped.
As a result, ensemble learning algorithms adapt much better to concept drift than incremental learning algorithms.
Generally speaking, incremental algorithms is faster and has better anti-noise capacity than ensemble algorithms. While ensemble algorithms is more flexible and adapt itself better to concept drift. Moreover, incremental algorithms has more restrictions than ensemble algorithms. Not all classification algorithms can be used in incremental learning, but almost every classification algorithms can be used in an ensemble algorithms.
Therefore, when there is no concept drift or concept drift is smooth, an incremental algorithm is recommended. While huge concept drift or abrupt concept drift exist, ensemble algorithms are recommended to guarantee accuracy. Otherwise, in case of relatively simple data-stream or a high level of real-time processing is demanded incremental learning is a better choice. And in case of complicated or unknown distribution data-stream ensemble learning is a better choice.
Experiments on incremental algorithms
In this part we will experiment on different incremental algorithms.
The accuracy of four kinds of incremental learning algorithms
Hyperplane 1 | Hyperplane 2 | Hyperplane 3 | Sensor | KDDcup99 | |
---|---|---|---|---|---|
VFDT | 90.42_{±0.13} | 78.93_{±1.7} | 82.73_{±3.08} | 92.21_{±2.19} | 99.69_{±0.01} |
CVFDT | 90.44_{±0.14} | 80.22_{±1.55} | 84.51_{±2.68} | 92.22_{±2.12} | |
Incremental Bayesian | 93.8_{±0.001} | 73.54_{±2.89} | 81.65_{±0.023} | 93.29_{±0.22} | 98.50_{±0.002} |
Incremental SVM | 90.8_{±0.14} | 70.5_{±3.79} | 80.12_{±2.96} | 91.89_{±1.21} | 97.96_{±0.97} |
Experiments on ensemble algorithms
In this section, horizontal ensemble framework is firstly discussed. And then we talked about vertical ensemble framework. In the end we compared these two ensemble frameworks.
The mean of the accuracy on data stream Hyperplane1
Data-chunk size | 500 | 1000 | 2000 | |
---|---|---|---|---|
Classifier number | Algorithm | |||
10 | SEA | 83.66_{±2.49} | 84.23_{±2.01} | 86.39_{±0.35} |
AWE | 92.88_{±0.11} | 93.34_{±0.05} | 93.61_{±0.10} | |
20 | SEA | 85.82_{±3.14} | 87.06_{±0.53} | 87.49_{±0.79} |
AWE | 93.33_{±0.12} | 93.63_{±0.06} | 93.79_{±0.09} | |
30 | SEA | 84.23_{±2.01} | 86.94_{±1.62} | 88.14_{±0.27} |
AWE | 93.49_{±0.10} | 93.70_{±0.06} | 93.85_{±0.08} |
The mean of the accuracy on data stream Hyperplane2
Data-chunk size | 500 | 1000 | 2000 | |
---|---|---|---|---|
Classifier number | Algorithm | |||
10 | SEA | 77.06_{±0.97} | 87.91_{±1.97} | 87.36_{±0.78} |
AWE | 84.94_{±3.87} | 85.72_{±0.53} | 89.09_{±.012} | |
20 | SEA | 86.32_{±1.21} | 87.17_{±0.99} | 91.15_{±0.29} |
AWE | 87.96_{±0.52} | 89.27_{±0.96} | 91.42_{±0.26} | |
30 | SEA | 89.22_{±1.56} | 88.49_{±0.5} | 90.44_{±0.08} |
AWE | 90.36_{±0.43} | 89.44_{±0.46} | 90.39_{±0.04} |
The mean of the accuracy on data stream Hyperplane3
Data-chunk size | 500 | 1000 | 2000 | |
---|---|---|---|---|
Classifier number | Algorithm | |||
10 | SEA | 77.08_{±9.59} | 78.44_{±11.07} | 77.42_{±4.03} |
AWE | 88.04_{±2.52} | 86.99_{±2.97} | 85.13_{±2.20} | |
20 | SEA | 79.4_{±18.41} | 78.58_{±5.78} | 75.8_{±4.98} |
AWE | 88.26_{±2.61} | 87.17_{±3.12} | 85.53_{±2.04} | |
30 | SEA | 79.26_{±6.22} | 78.82_{±8.45} | 75.44_{±1.84} |
AWE | 88.75_{±2.36} | 87.86_{±2.70} | 86.44_{±1.84} |
Competitive study on accuracy
Hyperplane 1 | Hyperplane 2 | Hyperplane 3 | Sensor | KDDcup99 | |
---|---|---|---|---|---|
Average vote horizontal ensemble | 88.14_{±0.27} | 91.15_{±0.29} | 78.82_{±8.45} | 89.05_{±3.85} | 99.44_{±0.01} |
Accuracy based horizontal ensemble | 93.85_{±0.08} | 91.42_{±0.26} | 88.75_{±2.36} | 87.34_{±2.17} | 99.31_{±0.04} |
Average vote vertical ensemble | 92.4_{±0.01} | 82.5_{±0.07} | 83.9_{±0.04} | 95_{±0.02} | 98.6_{±0.01} |
Accuracy based vertical ensemble | 93.6_{±0.01} | 84.3_{±0.06} | 86.6_{±0.04} | 94_{±0.05} | 98.4_{±0.01} |
Experiments on competitive learning
In this section, we will competitively discussed the advantages and disadvantages between incremental and ensemble algorithms. Consider the comparability, in each algorithm we selected decision tree as a base classifier. So we chose VFDT as representative of incremental algorithm and we used a accuracy based weighting algorithm in both horizontal ensemble and vertical ensemble algorithms.
The accuracy of different algorithms
Hyperplane 1 | Hyperplane 2 | Hyperplane 3 | Sensor | KDDcup99 | |
---|---|---|---|---|---|
VFDT | 90.42_{±0.13} | 78.93_{±0.17} | 82.73_{±3.08} | 92.21_{±2.19} | 99.69_{±0.01} |
Horizontal ensemble | 93.85_{±0.08} | 86.28_{±0.74} | 88.75_{±2.63} | 87.34_{±2.17} | 99.31_{±0.043} |
Vertical ensemble | 93.9_{±0.013} | 84.3_{±0.67} | 86.6_{±0.04} | 95.1_{±0.35} | 98.4_{±0.005} |
Cost time of different algorithms
Hyperplane 1 | Hyperplane 2 | Hyperplane 3 | Sensor | KDDcup99 | |
---|---|---|---|---|---|
VFDT | 3254.9 | 3310.6 | 3246.9 | 203.2 | 3479 |
Horizontal ensemble | 67760 | 67913 | 64237 | 5426 | 348114 |
Vertical ensemble | 63145 | 59876 | 62110 | 5897 | 30146 |
In a word, we can say that ensemble learning is more accuracy than incremental learning algorithms and incremental algorithm is more efficiency than ensemble algorithms.
Conclusion
Unlimited growth of big stream data and concept drift has been two most difficult problems in data-stream mining. There are two mainstream solutions to these problems: incremental learning and ensemble learning algorithms. Incremental learning algorithms employ a method of updating a single model by incorporating newly arrived data. While ensemble learning algorithms use the divide-and-conquer method to cutting up large data into small data chunks and training classifiers on each data chunk independently, then a heuristic algorithm is used to ensemble these classifiers together. In incremental algorithms, we talked mostly about how to recording previous knowledge and adapting to new knowledge. In ensemble learning algorithms, we discuss mostly about how to making a weighting policy for each base classifiers.
Both of these algorithms can handle big stream data and concept drift problems, and each of them has its own properties. Incremental learning algorithms have better performance on efficiency and ensemble learning adapts better to concept drift. Moreover, ensemble learning algorithms are more stable than incremental algorithms. The size of data chunk is another important factor in ensemble algorithms, which influences the algorithm performance. Generally, a better way to achieving high accuracy is that the higher levels a concept drift is the smaller a data chunk will be.
Therefore, in a case of loose concept drift or no concept drift an incremental algorithm is recommended and in a case of rigorous concept drift an ensemble algorithm is a better choice. Otherwise, when efficiency is first considering factor we tend to selecting incremental algorithm and when accuracy is the most important factor we choose an ensemble algorithm. We can employ different algorithms according to the real data-stream distributions.
Weighting policy, instances selection, classifier diversity and so on is the main rules discussed in previously researches. With the very fast development of information industry, we have to face the reality of information explosion. In that situation, more and more classifiers will be trained and real-time processing will become a challenge. Therefore, the next step is how to effectively managing large amount of classifiers. We can consider some pruning method or index technology on the classifiers. We can also consider some parallel algorithms to organizing the classifiers.
Declarations
Acknowledgments
This work was supported by the NSFC (No. 61370025), 863 projects (No.2011AA01A103 and 2012AA012502), 973 project (No. 2013CB329605 and 2013CB329606), and the Strategic Leading Science and Technology Projects of Chinese Academy of Sciences (No.XDA06030200).
Authors’ Affiliations
References
- Widmer G, Kubat M: Learning in the presence of concept drift and hidden contexts. Mach Learn 1996,23(1):69–101.Google Scholar
- Zhang P, Zhu X, Shi Y: Categorizing and mining concept drifting data streams. Proceedings of the 14th ACM SIGKDD international conference on knowledge discovery and data mining 2008, 812–820.View ArticleGoogle Scholar
- Giraud-Carrier C: A note on the utility of incremental learning. AI Commun 2000,13(4):215–223.Google Scholar
- Polikar R: Ensemble based systems in decision making. Circ Syst Mag 2006,6(3):21–45.View ArticleGoogle Scholar
- Tsymbal A: The problem of concept drift: definitions and related work. Computer Science Department, Trinity College Dublin; 2004.Google Scholar
- Domingos P, Hulten G: Mining high-speed data streams. 2000, 71–80.Google Scholar
- Hulten G, Spencer L, Domingos P: Mining time-changing data streams. 2001, 97–106.Google Scholar
- Hua C, Xiao-gang Z, Jing Z, Li-hua D: A simplified learning algorithm of incremental Bayesian. 2009.View ArticleGoogle Scholar
- Deng L, Droppo J, Acero A: Incremental Bayes learning with prior evolution for tracking nonstationary noise statistics from noisy speech data. 2003.View ArticleGoogle Scholar
- Alcobé JR: Incremental augmented Naive Bayes classifiers. 2004, 16: 539.Google Scholar
- Laskov P, Gehl C, Krüger S, Müller K-R: Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res 2006, 7: 1909–1936.MathSciNetGoogle Scholar
- Syed NA, Liu H, Huan S, Kah L, Sung K: Handling concept drifts in incremental learning with support vector machines. 1999, 317–321.Google Scholar
- Fung G, Mangasarian OL: Incremental support vector machine classification. 2002, 247–260.Google Scholar
- Zheng J, Yu H, Shen F, Zhao J: An online incremental learning support vector machine for large-scale data. Artificial neural networks–ICANN 2010, 76–81.Google Scholar
- Ruping S: Incremental learning with support vector machines. 2001, 641–642.Google Scholar
- Xiao R, Wang J, Zhang F: An approach to incremental SVM learning algorithm. 2000, 268–273.Google Scholar
- Tseng C-Y, Chen M-S: Incremental SVM model for spam detection on dynamic email social networks. 2009, 4: 128–135.Google Scholar
- Cauwenberghs G, Poggio T: Incremental and decremental support vector machine learning. Advances in neural information processing systems 2001, 409–415.Google Scholar
- Street WN, Kim Y: A Streaming Ensemble Algorithm (SEA) for large-scale classification. Google Scholar
- Kolter JZ, Maloof MA: Dynamic weighted majority: a new ensemble method for tracking concept drift. 2003, 123–130.Google Scholar
- Yeon K, Song MS, Kim Y, Choi H, Park C: Model averaging via penalized regression for tracking concept drift. J Comput Graph Stat 2010.,19(2):Google Scholar
- Wang H, Fan W, Yu PS, Han J: Mining concept-drifting data streams using ensemble classifiers. 2003, 226–235.Google Scholar
- Zhang P, Zhu X, Shi Y: Categorizing and mining concept drifting data streams. 2008, 812–820.Google Scholar
- Fan W, Chu F, Wang H, Yu PS: Pruning and dynamic scheduling of cost-sensitive ensembles. 2002, 146–151. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press, 1999Google Scholar
- Fan W: Systematic data selection to mine concept-drifting data streams. Google Scholar
- Zhao Q-L, Jiang Y-H, Xu M: A fast ensemble pruning algorithm based on pattern mining process. Data Min Knowl Discov 2009,19(2):277–292. 10.1007/s10618-009-0138-1MathSciNetView ArticleGoogle Scholar
- Lu Z, Wu X, Zhu X, Bongard J: Ensemble pruning via individual contribution ordering. Google Scholar
- Kuncheva LI: Classifier ensembles for changing environments. Multiple classifier systems 2004, 1–15.View ArticleGoogle Scholar
- Zhang P, Zhu X, Tan J, Guo L: Classifier and cluster ensembles for mining concept drifting data streams. 2010, 1175–1180.Google Scholar
- Minku LL, White AP, Yao X: The impact of diversity on online ensemble learning in the presence of concept drift. IEEE Trans Knowl Data Eng 2010, 5: 730–742.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.