 Survey Paper
 Open Access
 Published:
Boosting methods for multiclass imbalanced data classification: an experimental review
Journal of Big Data volume 7, Article number: 70 (2020)
Abstract
Since canonical machine learning algorithms assume that the dataset has equal number of samples in each class, binary classification became a very challenging task to discriminate the minority class samples efficiently in imbalanced datasets. For this reason, researchers have been paid attention and have proposed many methods to deal with this problem, which can be broadly categorized into data level and algorithm level. Besides, multiclass imbalanced learning is much harder than binary one and is still an open problem. Boosting algorithms are a class of ensemble learning methods in machine learning that improves the performance of separate base learners by combining them into a composite whole. This paper’s aim is to review the most significant published boosting techniques on multiclass imbalanced datasets. A thorough empirical comparison is conducted to analyze the performance of binary and multiclass boosting algorithms on various multiclass imbalanced datasets. In addition, based on the obtained results for performance evaluation metrics and a recently proposed criteria for comparing metrics, the selected metrics are compared to determine a suitable performance metric for multiclass imbalanced datasets. The experimental studies show that the CatBoost and LogitBoost algorithms are superior to other boosting algorithms on multiclass imbalanced conventional and big datasets, respectively. Furthermore, the MMCC is a better evaluation metric than the MAUC and Gmean in multiclass imbalanced data domains.
Introduction
Imbalanced data set classification is a relatively new research line within the broader context of machine learning studies, which tries to learn from the skewed data distribution. A data set is imbalanced when the samples of one class consist of more instances than the rest of the classes in twoclass and multiclass data sets [1]. Most of the standard machine learning algorithms show poor performance in this kind of datasets, because they tend to favor the majority class samples, resulting in poor predictive accuracy over the minority class [2]. Therefore, it becomes tough to learn the rare but important instances. In fact, they assume equal misclassification cost for all samples for minimizing the overall error rate.
Learning from skew datasets becomes very important when many realworld classification problems are usually imbalanced, e.g. fault prediction [3], fraud detection [4], medical diagnosis [5], text classification [6], oilspill detection in satellite images [7] and cultural modeling [8]. An equal misclassification cost associated to each of the classes in these datasets is not true. For example, in software fault prediction, if the defective module is regarded as the positive class and nondefective module as negative, then missing a defect (false negative) is much expensive than the falsepositive error in testing phase of software development process [9].
In recent years, several proposals have been made to deal with the class imbalance problem that can be broadly categorized into two groups: data level and algorithm level. Data level approaches rebalance the data distribution by resampling methods. These methods solve the problem by either increasing the minority class or decreasing the majority class observations. Many different sampling techniques such as random oversampling, random undersampling, synthetic minority oversampling technique (SMOTE), direct oversampling, and other related methods have been proposed, each with their pros and cons [10]. The algorithm level approaches modify the machine learning algorithm in such way that it accommodates imbalance data. The modification can be considering different misclassification costs for each class, also known as costsensitive methods, and minimizing the cost error instead of maximizing the accuracy rate [11]. The other modification is choosing a suitable inductive bias. For example, adjusting the probability estimate at the leaf when using a decision tree as a learner, or considering multiple minimum support for different classes in association rules. Many other approaches have been proposed in the learning algorithms aiming to reinforce them towards the minority class samples, such as ensemble approaches and deepbased algorithms. However, the ensemble method is one of the most wellknown approaches in this group which is our main focus in this study.
Ensemble classifiers also known as multiple classifier system, improve the performance of learning method by combining a collection of base classifiers for the classification system. The output of each base classifier is collected and used to prepare the classification decision for new samples. Specifically, in boosting, a sequential aggregate of base classifier is constructed on weighted versions of the training data, focusing on misclassified samples at each stage of generating classifiers based on the sample weights that are changed according to the performance of the classifier [12].
The imbalanced dataset problems become more complicated in multiclass imbalanced classification tasks, in which there may be multiple minority and majority classes that cause skew data distribution. In this case, for example, a class may be a minority one when compared to some other classes, but a majority of the rest of them [13]. Therefore, many new challenges are imposed which do not exist in twoclass cases. Fortunately, some strategies such as decomposition methods and introducing new loss functions have been proposed to deal with multiclass classification problem [14, 15].
The recent emergence of new technologies, such as internet of things (IoT), online social network (OSN), have eased the way to originate huge amount of data called as Big data which brings some difficulties for the extracting knowledge by learning algorithms due to their specific features, i.e., volume, velocity, variety, and veracity [16, 17]. Moreover, the imbalanced data problem can be often found in this kind of datasets and by considering multiclass property, learning from multiclass imbalanced big datasets becomes a very challenging task that has not been explored well by the researches, although this kind of data frequently appears in realworld classification problems [18].
Performance evaluation is another issue that should be addressed in this domain. It is obvious that for evaluating a classifier on imbalanced data, we require such metrics that could reflect its ability on predicting minority class(es) appropriately. Researchers have considered this issue by proposing some metrics for imbalanced data domain, such as area under the curve (AUC), Matthews correlation coefficient (MCC), Gmean, Kappa, and others that some of them have been successfully extended to multiclass problems [19, 20]. However, determining which one is more suitable for describing the performance of the classifier, needs to be examined. In this regard, some researchers have proposed frameworks for comparing evaluation metrics based on the empirical results [21, 22].
To the best of our knowledge, this is the first work to compare boosting algorithms on various multiclass imbalanced datasets. Therefore, in this study, we examine the performance of 14 most significant boosting algorithms on 19 multiclass imbalanced conventional and big datasets and compare them with each other. A comprehensive statistical test suite is used in order to investigate which boosting algorithm performs better in terms of evaluation metrics. Subsequently, three most prevalent evaluation metrics used in the imbalanced data literature are compared to determine which metric is more suitable for the multiclass imbalanced data domain.
Based on the aforementioned descriptions, the contributions and objectives of this review study are:

1.
To evaluate the performance of the most significant boosting classifiers on multiclass imbalanced datasets with different imbalance ratio.

2.
To determine which evaluation metric is more informative for showing the performance of a learning algorithm in the context of multiclass imbalanced datasets.

3.
To evaluate the performance of the most significant boosting algorithms on multiclass imbalanced big datasets.

4.
To analyze the computational time of the boosting algorithms.
The rest of this paper is structured as follows: "Handling imbalanced data problem" section reviews imbalanced data handling methods, followed by a summary on the stateoftheart boosting algorithms in "Boosting methods" section. The experimental results and discussions are presented in "Experimental study" section. Finally, the "Conclusion" section remarks the findings of this study.
Handling imbalanced data problem
In many realworld situations, the data distribution among classes of a dataset is not uniform, such that at least one class of data has fewer samples (minority class) than the other classes (majority classes). This poses a difficulty for standard machine learning algorithms as they will be biased towards majority classes. In this situation, they will present high accuracy on majority classes and poor performance on minority class [23], while the minority class may possess more valuable knowledge. A multiclass problem can have different styles, including many minority and majority classes, one minority class and many minority classes or one majority class and many minority classes [24]. In this condition, a class can be considered as a majority one compared to other classes, but it is considered as a minority or wellbalanced one for the rest of the classes [25]. Furthermore, the difficulty increases in the case of multiclass imbalanced datasets. For handling this unwanted circumstance, researchers have proposed some methods that can be classified into two major groups: datalevel methods and algorithmlevel methods. The details of these methods are explained in the following subsections.
Datalevel methods
Datalevel methods balance data among classes by resampling approaches. These approaches are categorized into three major techniques: undersampling, oversampling and hybridsampling. These techniques include many different forms of resampling such as random undersampling (samples of the majority classes are randomly removed from the dataset), directed undersampling (the choice of samples to eliminate is informed), random oversampling (generate new samples for minority classes randomly), directed oversampling (no new samples are created, but the choice of samples to replace is informed), and combinations of the above techniques [10]. However, the main drawbacks of these techniques are removing useful samples or introducing meaningless new samples to the datasets.
Undersampling methods remove samples from majority classes until the minority and majority classes become balanced. Therefore, the training process can be easily implemented, and it improves the problem associated with run time and storage [26]. The most common undersampling method is random undersampling (RUS) that removes samples from majority classes randomly.
Oversampling approaches create copies of existing samples or add more samples to the minority class [27]. They increase the training set size, thus consume extra time. The simplest method to increase the size of the minority class is random oversampling. This method increases the number of minority class samples by randomly replicating, but it is prone to overfitting [28]. For solving these problems, Chawla proposed the Synthetic Minority Oversampling technique (SMOTE), which creates synthetic samples from the minority class. The SMOTE samples are linear combinations of two similar samples from the minority class [29].
In the hybridsampling methods, more than one sampling technique is used together to remove the drawback of each sampling method [30].
For multiclass datasets, the sampling strategies should be adjusted the sampling procedures to consider individual properties of classes and their mutual relations. However, the class decomposition schemes are another most existing solution in order to handle multiclass data and alleviate the difficulties of this type of datasets [31]. The advantage of such approaches is simplifying the problems into multiple twoclass problem. But, there are some disadvantages with these kind of approaches, including losing the balanced performance on all classes and rejecting the global outlook on the multiclass problems. Despite the mentioned disadvantages, this direction can be considered as a promising one. Two commonly used schemes for class decomposition are oneversusall (OVA) and oneversusone (OVO) schemes. In the OVA strategy, the data points of a class are considered with the data point of the other classes. In this case, for the Kclass dataset, K classification problems are introduced. While in the OVO strategy, a binary classification problem is created between any two classes; therefore K(K−1)/2 classification problems are introduced.
Algorithmlevel methods
In the algorithmlevel category, the existing learner is modified to remove its bias towards the majority classes. The most common approach is the costsensitive learning in which the learner is forced to correctly classify the minority class samples by setting a high cost to the misclassified samples of minority class [32]. For correct classification samples, no penalty is assigned, but the misclassifying cost for minority samples is higher than the majority samples. The goal is to minimize the total cost of the training dataset [33]. However, determining the cost values is not an easy task, as they depend on multi factors that have tradeoff relationships.
Hybrid methods
Hybrid methods combine previously mentioned approaches. Ensemble learning is one of the most frequently used classifiers that combine data level and algorithmic level methods for handling the imbalanced data problem [34]. The main goal of the ensemble is obtaining better predictive performance than the case of using one classifier. However, its main drawback is generating more classifiers, which increases computational complexity [35]. The bestknown methods of the ensemble classifiers are bagging and boosting [36, 37]. In bagging, the original training set is divided into N subsets of the same size, and then each subset is used to create a classifier. The whole classification model is built by aggregating particular classifiers. On the other hand, boosting algorithms generates a new weak learning model, and after many rounds, the boosting algorithms combine these weak learners into a single prediction model that will be much more accurate than anyone of the weak learners.
Ensemble learning has been investigated in order to handle multiclass imbalanced problems, as well. Specifically, hybridizing ensemble methods such as boosting, bagging and random subspace with the previously mentioned solutions (sampling or costsensitive) has been proved to be effective in imbalanced data problem [38]. For example, Wang and Yao compared the performance of Adaboost.NC and Adaboost in combination with random oversampling with (or without) using class decomposition for multiclass imbalanced datasets [13]. By applying ensemble learning to each subproblem, Zhang et al. proposed a method to improve the performance of binary decomposition used in multiclass imbalanced problems [39]. Krawczyk also developed a new approach to handle multiclass imbalanced datasets by combining pairwise OVO decomposition and ensemble learning. He divided the original set into several binary subproblems in order to simplify the classification task. Then, an ensemble classifier designed for handling such simplified binary problems, and outputs of individual classifiers were combined with pairwise coupling method [40]. Feng et al. proposed a new ensemble marginbased algorithm which emphasizes on the use of a large number of informative low margin samples compared to margin samples by the aim of handling imbalanced data problem [41].
Boosting methods
Boosting as the most popular category of ensemble learning, builds a strong classifier by combining many weak classifiers [12]. The superiority of boosting is in its serial learning nature, which results in excellent approximation and generalization. In other words, weak classifiers are learned sequentially, aiming to reduce the errors of the previously modeled classifier. Many boosting approaches have been proposed so far. Each improves the classification performance by varying some steps of the general boosting scheme. Among the various kinds of boosting approaches, we have participated 14 wellknown methods in our experimental comparisons. The methods involve AdaBoost.MH, SAMME, LogitBoost, GradientBoost, XGBoost, LightGBM, CatBoost, SMOTEBoost, RUSBoost, MEBoost, AdaCost, AdaC1, AdaC2 and AdaC3. We have chosen these approaches due to their highcitation count and popularity. We have selected representative approaches from all the categories of boosting. AdaBoost.MH, SAMME, LogitBoost, GradientBoost, XGBoost, LightGBM and CatBoost are general boosting algorithms applicable for all applications. Besides, SMOTEBoost, RUSBoost, MEBoost, AdaCost, AdaC1, AdaC2 and AdaC3 are specifically developed for imbalanced data classification problem. Among these methods, SMOTEBoost and RUSBoost are datalevel approaches and the rest are algorithmlevel. The detailed descriptions of these approaches are provided in “General Boosting approaches” section. Furthermore, a brief description for some other known boosting methods is provided in Table 1. The description of boosting approaches involves their exclusive features, discriminations and proposal objectives. The presented techniques are sorted based on their publication year and cover most of the methods from 1997 to 2019.
General Boosting approaches
AdaBoost.MH
AdaBoost.MH, as a boosting approach proposed in 2000, is an extension of the AdaBoost algorithm. In order to deal with multiclass classification, AdaBoost.MH decomposes a multiclass problem into \(K(K1)/2\) binary problems (\(K\) is the number of classes) and applies a binary AdaBoost procedure to each of the binary datasets [42]. In order to better understand the AdaBoost.MH, binary AdaBoost algorithm is explained by details.
AdaBoost (Adaptive Boosting), which is one of the primary and wellknown boosting approaches, is proposed by Freund and Schapire in 1997. In its procedure, an ensemble of \(M\) classifiers is learned sequentially [43]. The main idea of AdaBoost is bolding the effect of misclassified samples by increasing their weights in the iterations of boosting. Hence, in all iterations of the algorithm, the sample set is fixed, and only their weights are changed. By doing so, classifiers can learn from the errors of the current classifiers and improve the accuracy. AdaBoost uses exponential loss function as Eq. (1).
The boosting procedure in AdaBoost is as follows. At first, all \(n\) samples get the same weight \({w}_{i}^{1}=1/n\). Then in each iteration of the boosting, a classifier \({f}^{m}(x)\) is selected with respect to the loss function and is fitted to the data points considering their current weights. Then, the error rate of the selected classifier is computed as Eq. (2).
In which \({f}^{m}\left({x}_{i}\right)\) and \({y}_{i}\) indicates the predicted value and actual output of the sample \({x}_{i}\), respectively. \({\mathbb{l}}\left({f}^{m}\left({x}_{i}\right)\ne {y}_{i}\right)\) is the indicator function with a value equal to 1 when the statement is true; otherwise, it is equal to zero.
Then the step size (or the coefficient of the \({f}^{m}\left({x}_{i}\right)\)) is calculated according to the error rate value as Eq. (3).
According to this equation, the fitted classifier of each iteration will get a high impact in the ensemble if it has a lower error rate on the training data. After calculating the error rate and step size of the classifier. The misclassified samples are reweighted according to Eq. (4).
When M classifiers are produced, the final ensemble model is constructed by Eq. (5).
The pseudocode of the AdaBoost algorithm is provided in Algorithm 1.
SAMME
SAMME [44] is a multiclass variant of AdaBoost, which uses a multiclass exponential loss function as follows.
In which the target variables are encoded using the codewords technique to the form \({y}_{i}={\left({y}_{1}, \ldots ,{y}_{K}\right)}^{T}\), \(i=1, 2,\ldots,N\), where
In this algorithm, the boosting procedure is exactly the same as the AdaBoost with minor differences. The error rate calculation for the weak learners and the reweighting phase for the misclassified samples are same as the AdaBoost. The difference is in the extra term \(\mathit{log}\left(K1\right)\) in the step size calculation equation as following equation.
The final ensemble model has the following form
LogitBoost
LogitBoost is one of the stateoftheart approaches proposed by Friedman et al. in 1998 [45, 46]. The LogitBoost utilizes adaptive Newton steps to fit an additive logistic model. So, LogitBoost is a boosting approach with logistic loss function (negative conditional loglikelihood) as Eq. (10).
where \({P}_{k}\left(x\right)=\frac{\mathrm{exp}\left({F}_{k}\left(x\right)\right)}{\sum_{j=1}^{K}\mathrm{exp}\left({F}_{j}\left(x\right)\right)}\) and \(\sum_{k=1}^{K}{F}_{k}\left(x\right)=0.\)
The boosting procedure in Kclass LogitBoost is as follows. At first, all \(n\) samples get the same weight \({w}_{i}^{1}=1/n\). The initial probability for each class for samples is set to \({P}_{k}\left(x\right)=1/K, k=1, \ldots, K\).
At each iteration of the LogitBoost, K independent classifiers that each minimizes the overall loss function with respect to the kth class, are fitted. So, an ensemble of classifiers is considered for each class. The final ensemble model is chosen by Eq. (11).
where \({F}_{k}\left(x\right)=\sum_{m=1}^{M}{f}_{k}^{m}\left(x\right)=\sum_{m=1}^{M}\frac{K1}{K}\left({f}_{k}^{m}\left(x\right)\frac{1}{K}\sum_{j=1}^{K}{f}_{j}^{m}\left(x\right)\right)\).
LogitBoost algorithm was designed as an alternative solution to address the limitations of AdaBoost in handling noise and outliers. It is less sensitive to outliers and noise due to using a logistic loss function that changes the loss function linearly. In contrast, AdaBoost uses an exponential loss function that changes exponentially with the classification error, which makes AdaBoost more sensitive to noise and outliers.
LogitBoost supports both binary and multiclass classification. This algorithm can handle multiclass problems by considering multiclass logistic loss. Also, it can address multiclass classification problems by using a parametric method. Some other multiclass variants of LogitBoost, have been proposed too, e.g., ABCLogitBoost [47] and AOSOLogitBoost [48] (which is a onevsone LogitBoost for multiclass problems).
GradientBoost
GradientBoost is based on trees which is applicable for various kinds of loss functions. Like other boosting algorithms, it builds a stagewise additive model. Unlike traditional boosting in which weaklearners fit a model to the output values of samples, in each iteration of this algorithm, the decision trees are generated by fitting the negative gradients [49]. Negative gradient is also called residual error that is a function of the difference between the predicted value and the real value of the output. Generally, in GradientBoost the model is initialized with a constant value \(\gamma \) (A tree with just one leaf node) that minimizes the loss function over all the samples as Eq. (12).
Then, in each iteration, the negative gradient of the loss function is computed with respect to the current ensemble as following equation.
A decision tree is fitted on the negative gradients (residuals) and then the values of the leaf nodes are computed with the goal of loss function minimization as Eq. (14).
where, \({J}_{m}\) is the number of the leaf nodes in the mth iteration tree, and \({\gamma }_{jm}\) is the value of jth leaf node at iteration m. Note that the summation of loss values is calculated over the samples that belong to that leaf nodes region \({R}_{jm}\). Finally, the new model is added to the ensemble as the following equation.
Besides the effective process of boosting in GradientBoost, a shrinkage metric \(\nu \) is also for controlling overfit to training data. In other words, in each boosting iteration, the selected model is multiplied by a coefficient between 0 and 1, which indicates the learning rate (shrinkage value). GradientBoost supports both binary and multiclass classifications.
CatBoost
CatBoost is a novel algorithm for gradient boosting on decision trees, which has the ability to handle the categorical features in the training phase. It is developed by Yandex researchers and is used for search, recommendation systems, personal assistant, selfdriving cars, weather prediction and many other tasks at Yandex and in some other companies. CatBoost has both CPU and GPU implementations which accelerates the training process [50].
Like all gradientbased boosting approaches, CatBoost consists of two phases in building trees. The first is choosing the tree structure and the second is setting the value of leaves for the fixed tree. One of the important improvements of the CatBoost is doing unbiased gradient estimation in order to control the overfit. To this aim, in each iteration of the boosting, for truly estimating the gradient of each sample, it excludes that sample from the training set of the current ensemble model. The other improvement is the automatic transformation of categorical features to numerical features without any preprocessing phase. CatBoost is applicable for both binary and multiclass problems.
XGBoost
XGBoost is an optimized version of GradientBoost, which is introduced by Chen in 2016. XGBoost has improved the traditional GradientBoost by adding up some efficient techniques to control overfitting, split finding and handling missing values in the training phase [51]. In order to control the overfitting, the objective (minimization) function consists of two parts: loss function and regularization term, which controls the complexity of the model as Eq. (16).
where, \(\Omega \left({f}^{m}\right)\) is the regularization term. In XGBoost, a secondorder approximation is used to optimize the objective function. Accordingly, in each iteration, the best tree is selected by using Eq. (17).
where, \(T\) is the number of leaf nodes. \({G}_{j}\) and \({H}_{j}\) are the summations of the first and secondorder gradient statistics on the loss function over the samples of the jth leaf node, respectively. \(\lambda \) and \(\gamma \) are the regularization coefficients. Thus, in this approach, the complexity of the tree is chosen and controlled separately in each iteration and the number of the leaves is not a fixed value during all the iterations. After the best tree selection in each iteration, the values of the leaf nodes are calculated by the use of gradient statistics of each leaf by the use of Eq. (18).
In addition, an approximate technique is used for split finding in each node of the tree. To this aim, for each feature, all the instances are sorted by that feature's value, then a linear search is done to find the best split along that feature. The best split is chosen by comparison among the best split of all the features. Furthermore, after constructing the tree, in each node the direction with maximum score is marked to make a default path on the tree for classification of data with missing values.
XGBoost algorithm has made excellent improvements in the implementation level. It utilizes parallel learning, outofcore computation and cacheaware access techniques to minimize the time complexity of learning, thus it is applicable for very large datasets.
LightGBM
LightGBM, as an efficient implementation of GradientBoost, is proposed by Ke et al. in order to solve the efficiency and scalability problem of GradientBoost in the case of high dimensional feature space or large data size. In the GradientBoost, for each feature, all the data samples are needed to be scanned and all the possible split points are examined. Therefore, the most timeconsuming part of the GradientBoost is the splitting procedure in each tree node. In the this algorithm, two novel techniques called Gradientbased OneSide Sampling (GOSS) and Exclusive Feature Bundling (EFB) are proposed. In the GOSS technique, only the small set of samples which have large gradients are considered in the split point selection. This works, because samples with larger gradients play a more important role in the computation of the information gain. In the FEB technique, in order to reduce the number of features, mutually exclusive features are bundled with the use of a greedy approach. LightGBM only supports binary problems [52].
Boosting for imbalanced data
SMOTEBoost
SMOTEBoost, proposed by Chawla et al., is a datalevel method to deal with the imbalanced data problem. The main steps of the proposed approach are SMOTE sampling and boosting. This algorithm uses SMOTE technique as a data level solution. SMOTE adds new minority class examples to a training dataset by finding the knearest neighbors (KNN) of a minority class example and extrapolating between that example and its neighbors to create new examples. Im addition, it adopts the AdaBoost algorithm as its ensemble approach. In each iteration of the SMOTEBoost, the sampling is done based on the original data, but the training process is done based on all the original and synthetic data. Like AdaBoost, after training a classifier the weights of the misclassified samples are updated, but here the update is done only for the original data. Thus, after each step of the boosting, the weights of the original data change and new synthetic data are generated. One of the main drawbacks of the SMOTEBoost is generating a large training dataset and, therefore, relatively slow in training phase of the boosting algorithm. SMOTEBoost supports both binary and multiclass problems [53].
RUSBoost
RUSBoost, as a datalevel solution, is proposed by Seiffert et al. to reduce the computational requirement of SMOTEBoost. RUSBoost combines data sampling and the AdaBoost algorithm. It uses RUS (Random Under Sampling) sampling approach to balance the initial data. RUS technique randomly removes the examples from the majority class. Each iteration of the boosting consists of two phases. In the first phase of the algorithm, random undersampling is performed on all the classes except the minority class. Undersampling continues until all the classes reach the same size (minority class size). In the second phase of the algorithm, AdaBoost is applied on the sampled data. So, the training set is not fixed over all the iterations. One of the drawbacks of the RUSBoost is its data loss in the undersampling level. It supports both binary and multiclass problems [54].
MEboost
MEBoost is a novel boosting approach that is proposed for handling imbalanced data with an algorithm level solution. MEBoost mixes two different weak learners with boosting to improve the performance and taking the benefits of both classifiers. In each iteration of the boosting, it either uses a decision tree (which uses information entropy as splitting metric) or an extra tree classifier (Randomized tree classification) as its learner. In this algorithm, the ensemble model is tested according to the AUC (area under Receiver Operating characteristic curve) score. It keeps adding weak learners to the model until there is no significant change in the auROC on the test data. Furthermore, MEBoost has an early stopping criterion for the case that the improvement between two iterations of boosting is not significant. MEBoost only supports binary problems [55].
AdaC
AdaC is proposed by Sun et al. in order to improve the imbalanced data classification. The main idea of this approach is to take an algorithm level solution to handle the imbalanced data by adding a cost item into the learning procedure of the AdaBoost. The cost item \({C}_{i}\in [0,+\infty ]\) (for each sample \(i\)) is added in order to increase the impact of minority class samples in the training phase and improve the performance. To this aim, the cost item is set to higher values for minority class samples and smaller values for majority class samples. The cost value of each sample shows the importance of correctly classifying that sample [56].
AdaC algorithm has three versions, namely AdaC1, AdaC2 and AdaC3. The difference between these three versions is in the way of adding cost to the weight update equation of the AdaBoost. In the AdaC1 algorithm, the cost item is embedded inside the exponent part. The weight update equation of AdaC1 is provided in Eq. (19).
Moreover, the step size parameter \({\alpha }^{m}\) is modified to the form shown in Eq. (20).
In the AdaC2 algorithm, the cost item is embedded outside the exponent part. The weight update equation of AdaC2 is shown in Eq. (21).
The step size parameter \({\alpha }^{m}\) is modified to the form shown in Eq. (21).
AdaC3 algorithm is the combination of AdaC1 and AdaC2, in which the cost item is embedded both inside and outside the exponent part as Eq. (23).
And the step size parameter \({\alpha }^{m}\) has the form provided in Eq. (24).
AdaC algorithms only support the binary classification.
AdaCost
AdaCost as another algorithm level costsensitive AdaBoost approach is proposed by Fan et al. in 1999 [57]. The main idea is to emphasize on the minority class classification accuracy and adding larger cost values to the minority class samples. In this approach, the weight of misclassified minority class samples increases higher than misclassified majority class samples. Moreover, the weight of a truly classified sample decreases less if it belongs to the minority class. As a result, the weight update equation is changed by adding a cost adjustment function \(\beta \), which is a function of cost item \(C\) as Eq. (25).
where, \({\beta }_{+}\) refers to the case \(\mathrm{sign} \left({f}^{m}\left({x}_{i}\right).{y}_{i}\right)=+1\) and \({\beta }_{}\) refers to the case \(\mathrm{sign} \left({f}^{m}\left({x}_{i}\right).{y}_{i}\right)=1\). In the main reference, the optimum value for \(\beta \) is recommended as \({\beta }_{+}=0.5{C}_{i}+0.5\) and \({\beta }_{}=0.5{C}_{i}+0.5\). Hence, the step size parameter \({\alpha }^{m}\) is modified to Eq. (26).
AdaCost algorithm only supports binary classification.
Experimental study
Dataset characteristics
The aforementioned boosting algorithms are evaluated on 15 conventional and 4 big multiclass imbalanced datasets from UCI^{Footnote 1} and OpenML^{Footnote 2} repositories. The characteristics of these datasets are given in Table 2.
Performance metrics
Using evaluation metrics is an essential factor to assess the classification performance of a learning algorithm. Accuracy and error rate are the most widely used metrics for this purpose. However, they are biased toward the majority class in imbalanced datasets as described in “Handling imbalanced data problem” section. Therefore, the values of these popular metrics do not show the ability of the classifier to predict examples from minority classes.
Alternative criteria have been used for this scenario that can be roughly categorized into singleclass and overall performance measures. The first group category of criteria shows how well a classifier performs in one class and the second one allows to have a clear and intuitive interpretation on the performance of the classifier on all classes.
Although in the literature, many evaluation metrics have been proposed to measure the classifier performance when dealing with imbalanced data problem, Precision, Recall, Fmeasure, AUC, Gmean, Kappa, and MCC are the most prevalently used metrics in the two above mentioned categories. All of these metrics are calculated from the confusion matrix. A confusion matrix is a twodimensional matrix that presents a clear explanation of how a classifier works in predicting new samples [71]. One dimension is indexed by the actual class of an object and the other by the class that the classifier predicts. Figure 1 shows the confusion matrix for a multiclass classification task. The entry N_{ij}, with \(i\ne j\), shows the number of samples that classified as class C_{j} but actually belong to class C_{i}. The best classifier will have zero values in nondiagonal entries.
Like the previous studies, the minority class is considered as positive, while the majority class is considered as negative.
Singleclass measures
Singleclass metrics are calculated for each class and are less sensitive to class imbalance, therefore, naturally are better suited to evaluate classifiers in skew data domains. In this section, the used singleclass metrics are briefly reviewed.
Precision metric measures the correctly classified positive class samples and defines as
where TP and FP indicate the counts of truepositive and falsepositive, respectively.
Recall measures the proportion of correctly identified of all real positive samples and computed by Eq. (28):
where FN indicates the counts of falsenegative.
In general, there is a tradeoff between precision and recall. To have an intuitive view of the performance of the classifier based on these metrics, Fmeasure [72] can be used. It is a harmonic mean of precision and recall in which the user can adjust the relative importance to recall as to precision by the β parameter. Usually, the value of this parameter is taken as 1. The following equation show how Fmeasure is computed:
Fmeasure can be extended by microaveraging or macroaveraging methods for multiclass cases. In microaveraging, Fmeasure is computed globally over all classes and its elements (Precision and Recall) are obtained by summing over all classes, while in macroaverage, it is obtained for each class locally, and then by averaging those, a single value can be computed as a Fmeasure value for whole classes.
Due to the independence of the imbalanced distribution, Geometricmean (Gmean) is another suitable single metric for imbalanced data problems [19] that is defined as Eq. (30)
Like Fmeasure, this metric can be extended by microaveraging or macroaveraging methods for the multiclass problems, however, in this study the method that was proposed by Sun et al. [19] is used to extend it to multiclass domains by the geometric mean of recall values of all classes as defined in Eq. (31).
Although both Gmean and Fmeasure measures can be used to evaluate the performance of a learning method in class imbalanced classification problems, Fmeasure may get infinity when classifier all samples were classified as negative class. Therefore, in this study we will use Gmean as a single class performance evaluation criterion.
Overall metrics
The area under the curve (AUC) of the receiver operating characteristic (ROC) is extremely used as an overall evaluation technique, especially for ranking classifiers in the presence of class imbalanced for binary classifiers. The ROC curve shows all possible conflicts between truepositive rate (TPR) and falsepositive rate (FPR) across various decision thresholds and AUC evaluation metric converts this curve to a value in the range of [0.5, 1], where the value 1 shows a perfect classifier and the lower value 0.5 means the classifier does not work better than random guess [73].
Although extending the AUC to multiclass problems is an open research topic, the MAUC metric [20] which averages the AUC value of all pairs of classes, are mostly used in the researches multiclass imbalanced data learning and defined as
where C denotes the number of classes, A(i, j) is the AUC between class i and class j.
Matthews Correlation Coefficient (MCC) is the third overall evaluation metric used in this study to evaluate the performance of boosting algorithms, which is especially utilized in biomedical imbalanced data analysis [74]. MCC contributes all classes’ values in the confusion matrix for computing a measure of the correlation between actual and predicted samples. Its value is located between − 1 and + 1. The value + 1 means perfect prediction and − 1 inverse prediction [75]. This metric can be directly calculated from the confusion matrix using Eq. (33)
Furthermore, MCC is a binary classification evaluation metric and how it performs in the multiclass imbalanced dataset has not yet been well studied. However, it has been shown that the behavior of MCC remains consistent in multiclass settings [76]. Based on these facts and inspired by the MAUC metric, we extend the MCC metric to multiclass case, called MMCC, by employing the OVO decomposition strategy. As MAUC, MMCC average the MCC value of all pairs of classes and it is defined as Eq. (34).
Parameter settings
The Performance of the algorithms is investigated with default parameter settings, except the nearest neighbor parameter for SMOTEboost is set to 3, and the population of the minority class C is increased to be approximately equal to the size of the majority class. In addition, since in gradient boosting methods the value between 8 and 32 is usually used for the number of leaves in the tree, the C4.5 decision tree is employed with depth 5 as a base learner in all boosting algorithms. Each boosting consists of 100 trees with the learning rate of 0.3.
For binary boosting algorithms, OneVersusOne (OVO) decomposition method is obtained to split multiclass datasets to multiple binary problems and then the voting aggregation strategy is used in order to detect the class of new data samples from the aggregated classifiers [77].
In the experiments, the means and standard deviations of three performance criteria, which were described in the previous section, are calculated for the algorithms, and the procedures are repeated 10 times by employing fivefold crossvalidation to reduce any bias because of the randomness. For creating the folds, the stratified sampling method is used instead of simple random sampling to guarantee that each fold contains a number of samples from each class especially minority class.
Statistical comparison of boosting algorithms
For comparing more rigorously the difference between classification algorithms, a statistical test suite is carried out on the experimental results to validate their results further and determine whether there exists a significant difference among them.
The statistical test procedure can be categorized into parametric (e.g., paired ttest and ANOVA) and nonparametric methods (e.g., the Wilcoxon and Friedman tests) from a methodological perspective. Although parametric tests are more robust than nonparametric, they are based on strict assumptions that are often violated in machine learning studies and make their results to be unreliable [78]. Therefore, nonparametric tests are preferred for comparing the results of classifiers.
According to the recommendation made by Demšar [79], in this study, the Friedman test is used to compare multiple classifiers over multiple datasets. The nullhypothesis (there is no significant difference between the performance of all algorithms) is rejected at a specified level of significance when Friedman test statistic (\({\chi }_{F}^{2}\)), computed as Eqs. (35) and (36), is greater than the critical value from the chisquare table. The rejection of the nullhypothesis implies the existence of at least two classification algorithms that are significantly different.
where D and M indicate the number of datasets and classification algorithms, AR_{j} is the rank of jth classifier, and \({r}_{i}^{j}\) is the rank of jth classifier on ith dataset.
After the rejection of nullhypothesis, a posthoc test should be conducted to determine whether the control a(usually, the best one) presents significantly statistical differences than the rest. In this study, the Holm posthoc test [79] is used for this purpose. In this test, the test statistic is computed as Eq. (37), and the Z value is used to find the pvalue from the table of normal distribution based on a level of significance of \(\alpha =\) 0.05. The nullhypothesis is rejected if the pvalue is lower than the Holm’s critical value.
where R_{i} and R_{j} are the rank of ith and jth classifier, respectively.
However, the Holm test does not detect the difference between a pair of algorithms. For pairwise comparison and determining the existence of significant differences between a pair of classifiers, the Wilcoxon ranksum test [80] is adopted.
Comparison of evaluation metrics
One of the key challenges in imbalanced data classification is selecting an appropriate evaluation metric for a better comparison of classification algorithms. To this aim, the statistical test proposed in [21] is carried out for comparing the performance of evaluation metrics. Firstly, the consistency of the evaluation metrics with each other are examined. Secondly, the degree of discriminancy of the metrics is investigated. Furthermore, the degree of Indifferency of metrics is tested as proposed in [22].
Degree of consistency
For a better comparison of two evaluation metrics on two algorithms, at least their manner should be consistent [74]. Two evaluation metrics are consistent with each other in the evaluation of two algorithms A and B, when metric f states that aA is better than aB, metric g doesn’t indicate that aB is better than aA. The degree of consistency (DoC) of two evaluation metrics f and g is calculated as follows.
where R and C are defined as Eqs. (39) and (40) in the domain \(\Psi \).
The value C shows the degree of consistency of two metrics f and g on algorithms A and B that its value is located in the interval \(\left[\mathrm{0,1}\right]\).
Degree of discriminancy
Besides the consistency of two evaluation metrics, there exist some cases that metric f can distinguish between two algorithms and states that for example, aA is better than aB, but the metric g cannot distinguish between these two algorithms. In this case, metric f is more discriminating than metric g. The degree of discriminancy (DoD) of metric f over metric g is calculated as follows.
where P and Q are defined as Eqs. (42) and (43) in the domain \(\Psi \).
The value D shows the degree of discriminancy of metric f over metric g on algorithms A and B. In order to claim that one metric is better than another one, the condition \(C>0.5\) and \(D>1\) should be satisfied. In this case, we can conclude that two metrics f and g are statistically consistent with each other and metric f can D times better tell the difference between two algorithms.
Degree of indifferency
There exist some cases in comparison that none of the two metrics can report the difference between two algorithms A and B. The degree of Indifferency (DoI) of two evaluation metrics f and g is calculated as follows.
where V and U are defined as Eqs. (45) and (46) in the domain \(\Psi \).
The value I shows the degree of Indifferency for metrics f and g on algorithms A and B.
Experimental results and analysis
In this section, which consists of two parts, we empirically compare the boosting algorithms that have been reviewed in “General Boosting approaches” section. In the first part, the performance of the algorithms and their computational time are compared on 15 multiclass imbalanced conventional datasets through various statistical tests. Then, the best evaluation metric is determined based on the obtained results. In the second part, the performance of the algorithms is investigated on 4 multiclass imbalanced big datasets.
The experiments were conducted using a computer system with Intel Core i3 2.13 GHz CPU, 8 GB RAM running Microsoft Windows 10 64bit operating system. The overall results were obtained by averaging the values of evaluation criteria over 10 runs of the algorithms (\(10\times 5\) crossvalidation). For the sake of clarity, it should be noted that the library of all algorithms were installed using the pip Python installer, e.g., sudo pip install xgboost, except MEBoost, SMOTEBoost, and AdaCosts, which their implemented python source codes are freely available at GitHub^{Footnote 3} repository.
Experiments on conventional datasets
The performance of the boosting algorithms are studied on 15 multiclass imbalanced conventional datasets as follows.
MAUC as a measure
Table 3 shows the average and standard deviation of the results for MAUC obtained for the boosting algorithms. The best results are in bold for each dataset. By looking at the results in this table, although it can be seen that CatBoost presents the highest average MAUC, but for the assurance we need to conduct a nonparametric statistical test on the results. By conducting the Friedman Test, the value of 0.043 was computed for pvalue, which rejects the null hypothesis. Moreover, the average ranks from this test are presented in Fig. 2. It is observable that CatBoost and AdaC1 achieved the lowest and highest rank, respectively. Therefore, we select this aas a control method and continue with the Holm posthoc test. Based on the presented results for pvalues in Table 4, it can be seen that CatBoost statistically outperforms AdaBoost, SAMME, MEBoost, and RUSBoost.
For comprehensive analysis and understanding of the difference between pairs of algorithms the Wilcoxon signedrank test is used. The results of this test are presented in Tables 5 and 6. Table 5 shows the computed ranks by the test. Each number below the main diagonal is the sum of the ranks for the datasets that the row aoutperforms the corresponding column a(\({R}_{+}\)), while each number above the main diagonal is the sum of the ranks for the datasets on which the column ais worse than the corresponding row a(\({R}_{}\)). Two classifiers are significantly different and reject the null hypothesis if the test statistic, \(\mathrm{min}({R}_{+}, {R}_{})\), is equal or less than the critical value which is 25 for the confidence level of α = 0.05 and N = 15, according to the table of Wilcoxon test.
The summary of the comparison results is presented in Table 5, in which the mark ● indicates that the row asignificantly outperforms the corresponding column algorithm. As can be observed from this table, CatBoost outperforms most of the algorithms. SMOTEBoost outperforms four algorithms and we saw that it has the second rank from Fig. 2. We can see that binary classification boosting algorithms could not outperform any algorithms except AdaCost that outperforms just AdaC1 algorithm. Therefore, we can conclude that binary boosting algorithms do not show better performance in multiclass classification problems based on the MAUC criteria. Another interesting analytic finding is that the undersampling method for encountering class imbalanced problems is not a satisfactory solution based on the obtained results for the RUSBoost algorithm, while SMOTEBoost which uses the oversampling method to handle imbalanced data, got the second Friedman rank and could outperform four algorithms.
By comparing a family of binary classification boosting algorithms that OVO strategy was taken to them to handle multiclass problems, in addition to get higher Friedman rank, none of them could outperform other algorithms, except AdaCost that outperforms just AdaC1 in terms of MAUC criterion.
From the costsensitive boosting approaches, in overall AdaCost shows better performance; however, AdaC3 presents better average MAUC value than the others. Moreover, AdaC3 works better than all algorithms for 3 datasets.
MMCC as a measure
Table 7 shows the average results of crossvalidation for MMCC measurement. The form of the reported results is similar to Table 3. It is clear that, except AdaBoost.MH, SAMME, RUSBoost, and AdaC1, the rest of the algorithms present the best MMCC for at least one dataset. However, the pvalue computed by the Friedman test is 0.0439, which rejects the null hypothesis. Furthermore, according to Fig. 3 CatBoost algorithm achieves the lowest average rank. By selecting the CatBoost algorithm as a control method and conducting the Holm posthoc test, we can see that it fails to find a significant difference between CatBoost and others, as stated in Table 8. On the contrary, the Wilcoxon ranksum test shows that CatBoost outperforms ten algorithms as presented in Tables 9 and 10.
From the MMCC viewpoint, according to Table 10, both XGBoost and SMOTEBoost can be considered as the second best algorithm due to outperforming four other algorithms, although, they are significantly better than the costsensitive boosting methods and the old AdaBoost.MH algorithm.
Another interesting point is that we can see the superiority of most of the gradientbased boosting algorithms (CatBoost, GradientBoost, XGBoost) and SMOTEBoost algorithm over other methods, therefore, integrating the used sampling method in SMOTEBoost (Synthetic Minority Oversampling Technique) with the gradientbased boosting algorithm will improve their performance in front of imbalanced datasets.
Gmean as a measure
The obtained results of Gmean for algorithms are reported in Table 11. It is observable that CatBoost presents the best results in more datasets than the other algorithms. Furthermore, as shown in Fig. 4, CatBoost achieves the lower Friedman average rank. But, unlike the previous measurement criteria, RUSBoost achieves the worst rank. By conducting Holm posthoc test, we can see the test cannot find a significant difference between the control method (CatBoost) and the other methods, as shown in Table 12. However, the Wilcoxon test shows the difference between each pair of algorithms.
As presented in Tables 13 and 14, we can see that CatBoost, LogitBoost, SMOTEBoost and XGBoost outperforms 5, 3, 3, and 1 algorithms, respectively, in terms of Gmean criterion. In addition, like the previous assessments by using MAUC and MMCC measurements, also in Gmean, no binary boosting algorithm outperforms other methods.
Computational time study
Computation time analysis was performed to determine which boosting algorithm shows lower computation time, and the results are presented in Table 15. It should be noted that the times are recorded just for one fold of crossvalidation in the training phase.
From Table 15, it is obvious that the LightGBM algorithm presents the lowest average computational time. This is an interesting observation, because LightGBM is a binary boosting algorithm and by using the OVO decomposition mechanism for handling multiclass problems, it should be generated boosting process for K(K−1)/2 pairs of classes (K is the number of classes). But, by comparing the computational times of algorithms for each dataset, one can see that XGBoost is better than LightGBM in most of the datasets. However, considering the classification performance of the algorithms, we can see CatBoost shows fair computational times.
SMOTEBoost shows the highest computational time due to generating synthetic samples for minority classes. Furthermore, although it presents the best performance for the largest dataset (Shuttle), it needs a lot of training time.
Metric evaluation study
In this section, we aim to investigate which metric would turn out to be the most suitable measure for evaluating the predictable performance of the algorithm on multiclass imbalanced problems by comparing MAUC, MMCC and Gmean metrics by using the comparison framework on the 15 imbalanced conventional datasets.
By obtaining the results of the boosting algorithms on these datasets with 5Fold crossvalidation, 75 test sets (15 × 5) are obtained for each pairwise comparison between evaluation metrics (MAUC vs. MMCC, MAUC vs. Gmean and MMCC vs. Gmean). In this test, all the possible pairs of 14 boosting algorithms (91 pairs) are considered. First of all, regardless of which metric is the best, the degree of consistency of evaluation metrics are examined. The results of the DoC comparison between metrics are reported in Table 2. The C value shown in the table for each pair of metrics is resulted by averaging the C values of 91 pairs of algorithms. As discussed in “Comparison of evaluation metrics” section, the value of C larger than 0.5 indicates the consistency of the two metrics. According to Table 16, The C value obtained for all the metric pairs is larger than 0.9, which satisfies the condition of being larger than 0.5. Thus, it can be claimed that MAUC is consistent with MMCC with a consistency degree of 0.9853. Also, MAUC is consistent with Gmean with a consistency degree of 0.9334 and finally MMCC and Gmean are constant with each other with the consistency degree of 0.9403. But among these three pairwise comparisons, MAUC and MMCC are more consistent with each other.
In order to select the best metric, the degree of discriminancy (DoD) of each metric over another metric is examined. The results of the DoD comparison between metrics are reported in Table 17. Exceptionally, in Table 17, instead of average D values, the ratio of the cases with D > 1 are reported. The reason for taking this approach is the existence of NaN and Infinite values, hence the averaging would not indicate the real results. According to the table, in the comparison between MAUC and Gmean, in all the cases (100% of cases), the value of D for MAUC over Gmean is larger than 1. So, MAUC always acts better than Gmean and in none of the 91 pairs of algorithms, Gmean wins (0%). Also, in the comparison between MMCC and Gmean, in all the cases (100% of the cases), the value of D for MMCC over Gmean is larger than 1 and for Gmean over MMCC, D > 1 is not observed in any of the 91 pairs. Until now, the results prove that both MAUC and MMCC are more discriminant than Gmean. The final step in selecting the best metric is comparing the D values between MAUC and MMC metrics. According to Table 3, in only 12% of the cases (11 out of 91), the DoD of MAUC over MMCC is larger than 1. But in 70% of the cases (64 out of 91), the DoD of MMCC over MAUC is larger than 1. Thus, it can be claimed that MMCC is more discriminant than MAUC and can better distinguish between classification algorithms. The conclusion of the DoC test between evaluation metrics is that MMCC and MAUC are totally better than Gmean and MMCC act slightly better than MAUC.
Finally, the degree of indifferency (DoI) of evaluation metrics is tested. The results of DoI comparison are reported in Table 18 by averaging the results of 91 pairs of algorithms. According to this table, by comparing MAUC and MMCC, in nearly 8% of the cases, none of the metrics can distinguish between the classification algorithms. Furthermore, by comparing MAUC and Gmean, the average DoC is equal to 0.0853, so, in 8% of the cases, both MAUC and Gmean cannot tell the difference between classification algorithms. This result is nearly true for the comparison between MMCC and Gmean. Also, in this case, in 8% of the comparisons, they both cannot distinguish the difference between classification algorithms. In all the pairwise comparisons of DoI, the value of E is approximately 8%.
Experiments on big datasets
According to study [16], a dataset consisting at least 100,000 instances can be considered as a big dataset. Therefore, 4 big multiclass imbalanced datasets, as shown in Table 2, were used to investigate the performance of the boosting algorithms [17, 18]. The obtained results for the used three performance metrics on these datasets are presented in Table 19.
Based on the previous experiments that were determined MMCC is the best evaluation metric, we performed Friedman test on the obtained results of this metric for ranking the algorithms. Figure 5 shows the outcome of this statistical test. It is obvious that the LogitBoost algorithm performs better than the other algorithms on big datasets in overall. However, in particular big datasets with high imbalanced ratio (FARS and Poker), the LogitBoost could not present good results. For instance, according to Table 19, AdaC3 performs well in FARS dataset based on the MMCC metric. Moreover, all of the boosting methods perform worse than random prediction in Poker dataset and most of them present very low performance on FARS dataset. By considering the obtained values for Gmean metric, we can conclude that all of the algorithms could not learn one or some of the classes.
Result discussion
If we rank the algorithms based on all performance metrics, then CatBoost is the first due to outperforming more algorithms than the others. The second best algorithm is SMOTEBoost that outperforms 4, 4, and 3 other boosting methods in terms of MAUC, MMCC, and Gmean, respectively. This observation indicates that oversampling is an effective method in imbalanced data classification, while we can see RUSBoost, which uses undersampling method, could not outperform any other algorithm.
Considering all evaluation metrics, XGBoost and LogitBoost outperform 7 and 6 algorithms, respectively. Thus, XGBoost can be considered as the third best algorithm. It should be noted that it is possible to get better results from XGBoost by finding the optimal values of its parameters, but it is not an easy task and having many parameters increases the complexity of this issue.
One of the interesting observations is that no binary boosting algorithm with decomposition mechanism is not better than the others, except AdaCost which is better than AdaC1 in terms of MAUC. Thus, multiclass boosting algorithms can deal with imbalanced data problem in a more satisfactory way due to using a proper margin loss function for multiclass problems. However, from Tables 3, 7 and 11, it can be seen that there are some datasets in which binary boosting methods show the best results.
In addition, according to the obtained ranks from the Freidman test, the algorithm with lower rank (CatBoost) was selected as a control algorithm for posthoc test, but it failed to reject the null hypothesis, especially in the results for MMCC and Gmeans metrics. Therefore, there are some cases that the Holm posthoc test fails in detecting differences between algorithms and it is necessary to use pairwise statistical tests.
The average computational times of algorithms on all datasets indicate that the LightGBM which belongs to the binary boosting set is faster than the other algorithms, although it should be run for each pair of classes. The reason for this observation is that binary methods show the lower computational time in large datasets. However, among the multiclass boosting methods, XGBoost has the lowest computation time, which is due to designing for speed and performance. Eventually, SMOTEBoost has the highest computation time where in large datasets, it takes lots of hours. This is due to generating extra samples for minority classes.
Finally, by comparing the metrics through a comparison framework, it can be seen that MMCC is the most suitable metric among the used evaluation metrics in the classification task with multiclass imbalanced datasets. The conducted test shows that although all 3 metrics are consistent with each other and have the same degree of indifferency, MAUC is more discriminant than Gmean and MMCC is more discriminant than both MAUC and Gmean. The better discriminacy of MMCC can be explained by its computational formula. MMCC contributes all elements of confusion matrix in its formula, but Gmean is calculated by using just the recall values of classes in its formula and MAUC is computed by the area of the ROC curve, which plots the truepositive rate against falsepositive rate of a classifier.
The obtained results for big datasets are different. In overall, although the LogitBoost algorithm performs well, but no algorithm can present better performance in highly imbalanced big datasets. By presenting zero or very small values for Gmean metric, it can be observed that these boosting methods cannot learn some classes in this kind of datasets. Therefore, the boosting algorithms studied in this work are not suitable for multiclass imbalanced big datasets with higher imbalanced ratio.
Conclusion
In this paper, the most significant boosting ensemble algorithms to deal with multiclass imbalanced data have been reviewed. Many reallife problems are naturally multiclass imbalanced and in recent years, this issue has been handled by using ensemble learning techniques. However, there was a lack of study in the literature to investigate and compare the performance of boosting algorithms as a class of ensemble learning methods on this type of datasets.
According to the analysis, 14 binary and multiclass boosting algorithms investigated on 19 multiclass conventional and big datasets with various imbalanced ratios. By comparing multiple algorithms on multiple datasets, it is observed that CatBoost algorithm outperforms most of the other boosting methods in terms of MAUC, Gmean, and MMCC on 15 conventional datasets. SMOTEBoost can be considered as a second best algorithm that outperforms 4, 4, and 3 algorithms in terms of MAUC, MMCC and Gmean metrics, respectively in conventional datasets. Therefore, it can be concluded that oversampling is an effective approach for tackling the imbalanced data problem. Both XGBoost and LogitBoost algorithms can be expressed as a third algorithm that could perform other algorithms. One of the interesting observations was that just multiclass algorithms could outperform others in which the training process took place without using class decomposition.
The computational time was performed as a part of this study to show which algorithm is faster. The results indicate that binary algorithms present low computational time although they should run for each pair of classes.
The used evaluation metrics were compared together based on the results of algorithms in conventional datasets and it was observed that MMCC is the most suitable metric in the multiclass imbalanced data domain. However, Gmeans metric gives this useful information that the learner could not classify the samples of one or more of the classes if it gets the zero value.
Finally, the results for big datasets were different. It was observed that LogitBoost algorithm performs better than the other algorithms in overall. But generally, boosting methods cannot present good prediction performance in multiclass big datasets with high imbalanced ratio. Therefore, one open research issue is how to modify the studied boosting methods so that they can perform well on highly imbalanced big datasets.
The other potential future works of this study involve the following: (1) An investigation of the effect of the preprocessing operators on the performance of the boosting algorithms. In this study, no preprocessing operator such as normalization and standardization was applied to the data, but it is most likely to obtain better results with experiences by using this preprocessing operation. (2) The study of parameter values and determining under which values a boosting algorithm, XGBoost for example, could improve the performance of multiclass imbalanced data. (3) In this study, decision tree was employed as a base learner under the boosting procedure, but any other learning model can be used. Hence, the performance of boosting approaches with other base learners can be investigated on multiclass imbalanced datasets. (4) Comparing the performance of binary boosting algorithms in multiclass data using other decomposition ways such as oneversusall and allversusall to decompose the multiclass problem into multiple binary decision problems. (5) Comparing the other evaluation metrics which are suitable for the multiclass imbalanced domain, like Kappa, Weighted Accuracy, and other measures.
Availability of data and materials
The datasets used during the current study are available https://archive.ics.uci.edu/ml/index.php.
Notes
 1.
 2.
 3.
MEBoost: https://github.com/farshidrayhanuiu/MEBoost
SMOTEBoost, AdaC1, AdaC2, AdaC3, AdaCost: https://github.com/gkapatai/MaatPy
Abbreviations
 AUC:

area under the curve
 MAUC:

multiclass area under the curve
 ROC:

receiver operating characteristic
 MCC:

Matthews correlation coefficient
 MMCC:

multiclass Matthews correlation coefficient
 SMOTE:

synthetic minority oversampling technique
 OVA:

oneversusall
 OVO:

oneversusone
 TP:

true positive
 FP:

false positive
 TN:

true negative
 FN:

false negative
 TPR:

true positive rate
 FPR:

false positive rate
 DoC:

degree of consistency
 DoI:

degree of indifferency
References
 1.
Japkowicz N. Learning from imbalanced data sets: a comparison of various strategies. In: AAAI workshop on learning from imbalanced data sets, Vol. 68. 2000. p. 10–5.
 2.
Batista GE, Prati RC, Monard MC. A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD Explorations Newsl. 2004;6(1):20–9.
 3.
Shatnawi R. Improving software faultprediction for imbalanced data. In: 2012 international conference on innovations in information technology (IIT); 2012. p. 54–9.
 4.
Di Martino M, Decia F, Molinelli J, Fernández A. Improving electric fraud detection using class imbalance strategies. In: ICPRAM; 2012. p. 135–41.
 5.
Majid A, Ali S, Iqbal M, Kausar N. Prediction of human breast and colon cancers from imbalanced data using nearest neighbor and support vector machines. Comput Methods Programs Biomed. 2014;113(3):792–808.
 6.
Liu Y, Loh HT, Sun A. Imbalanced text classification: a term weighting approach. Expert Syst Appl. 2009;36(1):690–701.
 7.
Kubat M, Holte RC, Matwin S. Machine learning for the detection of oil spills in satellite radar images. Mach Learn. 1998;30(2–3):195–21515.
 8.
Su P, Mao W, Zeng D, Li X, Wang FY. Handling class imbalance problem in cultural modeling. In: 2009 IEEE international conference on intelligence and security informatics; 2009. p. 251–6.
 9.
Abdi Y, Parsa S, Seyfari Y. A hybrid oneclass rule learning approach based on swarm intelligence for software fault prediction. Innovations Syst Softw Eng. 2015;11(4):289–301.
 10.
Ganganwar V. An overview of classification algorithms for imbalanced datasets. Int J Emerg Technol Adv Eng. 2012;2(4):42–7.
 11.
Kotsiantis S, Kanellopoulos D, Pintelas P. Handling imbalanced datasets: a review. GESTS Int Trans Computer Sci Eng. 2006;30(1):25–36.
 12.
Ferreira AJ, Figueiredo MA. Boosting algorithms: a review of methods, theory, and applications. In: Ensemble machine learning. Boston: Springer; 2012. p. 35–85.
 13.
Wang S, Yao X. Multiclass imbalance problems: analysis and potential solutions. IEEE Trans Syst Man Cybern. 2012;42(4):1119–30.
 14.
Bi J, Zhang C. An empirical comparison on stateoftheart multiclass imbalance learning algorithms and a new diversified ensemble learning scheme. KnowlBased Syst. 2018;15(158):81–93.
 15.
Wu K, Zheng Z, Tang S. BVDT: A boosted vector decision tree algorithm for multiclass classification problems. Int J Pattern Recognit Artif Intell. 2017;31(05):1750016.
 16.
Leevy JL, Khoshgoftaar TM, Bauder RA, Seliya N. A survey on addressing highclass imbalance in big data. J Big Data. 2018;5(1):42.
 17.
AbuSalih B, Chan KY, AlKadi O, AlTawil M, Wongthongtham P, Issa T, Saadeh H, AlHassan M, Bremie B, Albahlal A. Timeaware domainbased social influence prediction. J Big Data. 2020;7(1):10.
 18.
Sleeman IV WC, Krawczyk B. Bagging Using InstanceLevel Difficulty for MultiClass Imbalanced Big Data Classification on Spark. In2019 IEEE International Conference on Big Data (Big Data) 2019 (pp. 2484–2493). IEEE.
 19.
Sun Y, Kamel MS, Wang Y. Boosting for learning multiple classes with imbalanced class distribution. In: Sixth international conference on data mining (ICDM'06); 2006. p. 592–602.
 20.
Zhen L, Qiong L. A new feature selection method for internet traffic classification using ml. Phys Procedia. 2012;1(33):1338–455.
 21.
Ling CX, Huang J, Zhang H. AUC: a statistically consistent and more discriminating measure than accuracy. Ijcai. 2003;3:519–24.
 22.
Huang J, Ling CX. Using AUC and accuracy in evaluating learning algorithms. IEEE Trans Knowl Data Eng. 2005;17(3):299–310.
 23.
Singh A, Purohit A. A survey on methods for solving data imbalance problem for classification. Int J Computer Appl. 2015;127(15):37–41.
 24.
FernáNdez A, LóPez V, Galar M, Del Jesus MJ, Herrera F. Analysing the classification of imbalanced datasets with multiple classes: Binarization techniques and adhoc approaches. KnowlBased Syst. 2013;1(42):97–110.
 25.
Krawczyk B. Learning from imbalanced data: open challenges and future directions. Prog Artif Intell. 2016;5(4):221–32.
 26.
Tahir MA, Asghar S, Manzoor A, Noor MA. A classification model for class imbalance dataset using genetic programming. IEEE Access. 2019;8(7):71013–377.
 27.
Ramentol E, Caballero Y, Bello R, Herrera F. SMOTERSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced datasets using SMOTE and rough sets theory. Knowl Inf Syst. 2012;33(2):245–65.
 28.
Liu A, Ghosh J, Martin CE. Generative oversampling for mining imbalanced datasets. In: DMIN; 2007. p. 66–72.
 29.
Kumari C, Abulaish M, Subbarao N. Using SMOTE to deal with classimbalance problem in bioactivity data to predict mTOR inhibitors. In: Proceedings of the international conference on adaptive computational intelligence (ICACI), Mysuru, India; 2019. p. 1–12.
 30.
Colton D, Hofmann M. Sampling techniques to overcome class imbalance in a cyberbullying context. J ComputerAssist Linguistic Res. 2019;3(3):21–40.
 31.
Esteves VM. Techniques to deal with imbalanced data in multiclass problems: a review of existing methods.
 32.
Ling CX, Sheng VS. Costsensitive learning and the class imbalance problem. Encyclopedia Mach Learn. 2008;2011:231–5.
 33.
Maheshwari S, Agrawal J, Sharma S. New approach for classification of highly imbalanced datasets using evolutionary algorithms. Int J Sci Eng Res. 2011;2(7):1–5.
 34.
Błaszczyński J, Stefanowski J. Neighbourhood sampling in bagging for imbalanced data. Neurocomputing. 2015;20(150):529–42.
 35.
Rokach L. Ensemblebased classifiers. Artif Intell Rev. 2010;33(1–2):1–39.
 36.
Schapire RE. A brief introduction to boosting. Ijcai. 1999;99:1401–6.
 37.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
 38.
Galar M, Fernandez A, Barrenechea E, Bustince H, Herrera F. A review on ensembles for the class imbalance problem: bagging, boosting, and hybridbased approaches. IEEE Trans Syst Man Cybern. 2011;42(4):463–84.
 39.
Zhang Z, Krawczyk B, Garcìa S, RosalesPérez A, Herrera F. Empowering onevsone decomposition with ensemble learning for multiclass imbalanced data. KnowlBased Syst. 2016;15(106):251–63.
 40.
Krawczyk B. Combining onevsone decomposition and ensemble learning for multiclass imbalanced data. In: Proceedings of the 9th international conference on computer recognition systems CORES 2015. Cham: Springer; 2016. p. 27–36.
 41.
Feng W, Huang W, Ren J. Class imbalance ensemble learning based on the margin theory. Appl Sci. 2018;8(5):815.
 42.
Schapire RE, Singer Y. BoosTexter: A boostingbased system for text categorization. Mach Learn. 2000;39(2–3):135–68.
 43.
Freund Y, Schapire RE. A decisiontheoretic generalization of online learning and an application to boosting. In: European conference on computational learning theory. Heidelberg: Springer; 1995. p. 23–37.
 44.
Hastie T, Rosset S, Zhu J, Zou H. Multiclass adaboost. Stat Interface. 2009;2(3):349–60.
 45.
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann Stat. 2000;28(2):337–407.
 46.
Sun P, Reid MD, Zhou J. An improved multiclass LogitBoost using adaptiveonevsone. Mach Learn. 2014;97(3):295–32626.
 47.
Li P. Abclogitboost for multiclass classification. arXiv preprint: arXiv:0908.4144. 2009.
 48.
Sun P, Reid MD, Zhou J. Aosologitboost: Adaptive onevsone logitboost for multiclass problem. arXiv preprint: arXiv:1110.3907. 2011.
 49.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;1:1189–232.
 50.
Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. In: Advances in neural information processing systems. 2018. p. 6638–48.
 51.
Chen T, Guestrin C. Xgboost: A scalable tree boosting system. In: Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016. p. 785–94.
 52.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, Ye Q, Liu TY. Lightgbm: A highly efficient gradient boosting decision tree. In: Advances in neural information processing systems; 2017. p. 3146–54.
 53.
Chawla NV, Lazarevic A, Hall LO, Bowyer KW. SMOTEBoost: Improving prediction of the minority class in boosting. In: European conference on principles of data mining and knowledge discovery. Springer: Berlin; 2003. p. 107–19
 54.
Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Trans Syst Man Cybern Syst Hum. 2009;40(1):185–97.
 55.
Rayhan F, Ahmed S, Mahbub A, Jani MR, Shatabda S, Farid DM, Rahman CM. MEBoost: mixing estimators with boosting for imbalanced data classification. In: 2017 11th international conference on software, knowledge, information management and applications (SKIMA); 2017. p. 1–6.
 56.
Sun Y, Kamel MS, Wong AK, Wang Y. Costsensitive boosting for classification of imbalanced data. Pattern Recogn. 2007;40(12):3358–78.
 57.
Fan W, Stolfo SJ, Zhang J, Chan PK. AdaCost: misclassification costsensitive boosting. Icml. 1999;99:97–105.
 58.
Ting KM. A comparative study of costsensitive boosting algorithms. In: Proceedings of the 17th international conference on machine learning. 2000.
 59.
Domingo C, Watanabe O. MadaBoost: A modification of AdaBoost. In: COLT; 2000. p. 180–9.
 60.
Joshi MV, Agarwal RC, Kumar V. Predicting rare classes: can boosting make any weak learner strong? In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining; 2002. p. 297–306.
 61.
Joshi MV, Kumar V, Agarwal RC. Evaluating boosting algorithms to classify rare classes: comparison and improvements. In: Proceedings 2001 IEEE international conference on data mining; 2001. p. 257–64.
 62.
Vezhnevets A, Vezhnevets V. Modest AdaBoostteaching AdaBoost to generalize better. Graphicon. 2005;12(5):987–97.
 63.
Mease D, Wyner A, Buja A. Costweighted boosting with jittering and over/undersampling: Jousboost. J Mach Learn Res. 2007;8:409–39.
 64.
Jin X, Hou X, Liu CL. Multiclass AdaBoost with hypothesis margin. In: 2010 20th international conference on pattern recognition. 2010. p. 65–8.
 65.
Chen S, He H, Garcia EA. RAMOBoost: ranked minority oversampling in boosting. IEEE Trans Neural Netw. 2010;21(10):1624–42.
 66.
Saberian MJ, Vasconcelos N. Multiclass boosting: theory and algorithms. In: Advances in neural information processing systems; 2011. p. 2124–32.
 67.
Galar M, Fernández A, Barrenechea E, Herrera F. EUSBoost: enhancing ensembles for highly imbalanced datasets by evolutionary undersampling. Pattern Recogn. 2013;46(12):3460–71.
 68.
DíezPastor JF, Rodríguez JJ, GarcíaOsorio C, Kuncheva LI. Random balance: ensembles of variable priors classifiers for imbalanced data. KnowlBased Syst. 2015;1(85):96–111.
 69.
Ahmed S, Rayhan F, Mahbub A, Jani MR, Shatabda S, Farid DM. LIUBoost: locality informed underboosting for imbalanced data classification. In: Emerging technologies in data mining and information security. Singapore: Springer; 2019. p. 133–44.
 70.
Kumar S, Biswas SK, Devi D. TLUSBoost algorithm: a boosting solution for class imbalance problem. Soft Comput. 2019;23(21):10755–67.
 71.
Deng X, Liu Q, Deng Y, Mahadevan S. An improved method to construct basic probability assignment based on the confusion matrix for classification problem. Inf Sci. 2016;1(340):250–61.
 72.
Chicco D, Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21(1):6.
 73.
Saito T, Rehmsmeier M. The precisionrecall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10:3.
 74.
Halimu C, Kasem A, Newaz SS. Empirical Comparison of Area under ROC curve (AUC) and Mathew Correlation Coefficient (MCC) for evaluating machine learning algorithms on imbalanced datasets for binary classification. In: Proceedings of the 3rd international conference on machine learning and soft computing; 2019. p. 1–6.
 75.
Rahman MS, Rahman MK, Kaykobad M, Rahman MS. isGPT: An optimized model to identify subGolgi protein types using SVM and Random Forest based feature selection. Artif Intell Med. 2018;1(84):90–100.
 76.
Jurman G, Riccadonna S, Furlanello C. A comparison of MCC and CEN error measures in multiclass prediction. PLoS ONE. 2012;7:8.
 77.
Zhang ZL, Luo XG, García S, Tang JF, Herrera F. Exploring the effectiveness of dynamic ensemble selection in the oneversusone scheme. KnowlBased Syst. 2017;1(125):53–63.
 78.
Singh PK, Sarkar R, Nasipuri M. Significance of nonparametric statistical tests for comparison of classifiers over multiple datasets. Int J Comput Sci Math. 2016;7(5):410–42.
 79.
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
 80.
Wilcoxon F, Katti SK, Wilcox RA. Critical values and probability levels for the Wilcoxon rank sum test and the Wilcoxon signed rank test. Selected Tables Math Stat. 1970;1:171–259.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Affiliations
Contributions
YA: conceptualization, methodology, software, writing, designing the experiment. NS: software, writing, designing the experiment. NR: writing, designing the experiment under the supervision of JT and MA as academic supervisors. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tanha, J., Abdi, Y., Samadi, N. et al. Boosting methods for multiclass imbalanced data classification: an experimental review. J Big Data 7, 70 (2020). https://doi.org/10.1186/s4053702000349y
Received:
Accepted:
Published:
Keywords
 Boosting algorithms
 Imbalanced data
 Multiclass classification
 Ensemble learning