CatBoost for big data: an interdisciplinary review

Hancock, John T.; Khoshgoftaar, Taghi M.

doi:10.1186/s40537-020-00369-8

Survey Paper
Open access
Published: 04 November 2020

CatBoost for big data: an interdisciplinary review

Journal of Big Data volume 7, Article number: 94 (2020) Cite this article

30k Accesses
436 Citations
2 Altmetric
Metrics details

Abstract

Gradient Boosted Decision Trees (GBDT’s) are a powerful tool for classification and regression tasks in Big Data. Researchers should be familiar with the strengths and weaknesses of current implementations of GBDT’s in order to use them effectively and make successful contributions. CatBoost is a member of the family of GBDT machine learning ensemble techniques. Since its debut in late 2018, researchers have successfully used CatBoost for machine learning studies involving Big Data. We take this opportunity to review recent research on CatBoost as it relates to Big Data, and learn best practices from studies that cast CatBoost in a positive light, as well as studies where CatBoost does not outshine other techniques, since we can learn lessons from both types of scenarios. Furthermore, as a Decision Tree based algorithm, CatBoost is well-suited to machine learning tasks involving categorical, heterogeneous data. Recent work across multiple disciplines illustrates CatBoost’s effectiveness and shortcomings in classification and regression tasks. Another important issue we expose in literature on CatBoost is its sensitivity to hyper-parameters and the importance of hyper-parameter tuning. One contribution we make is to take an interdisciplinary approach to cover studies related to CatBoost in a single work. This provides researchers an in-depth understanding to help clarify proper application of CatBoost in solving problems. To the best of our knowledge, this is the first survey that studies all works related to CatBoost in a single publication.

Introduction

Modeling a system with regression or classification are common ways to scientifically investigate phenomena. Since Supervised Machine Learning ($\text {ML}$) [1] provides a way to automatically create regression and classification models from labeled datasets, researchers use Supervised $\text {ML}$ to model all sorts of phenomena in various fields. Hence, it is vital to stay informed on supervised $\text {ML}$ techniques practitioners currently use to achieve success. This is the first study that takes an interdisciplinary approach to reveal the emerging body of literature that shows CatBoost is an effective tool for use in supervised $\text {ML}$ techniques.

CatBoost is an open source, Gradient Boosted Decision Tree (GBDT) implementation for Supervised $\text {ML}$ bringing two innovations: Ordered Target Statistics and Ordered Boosting. We cover these innovations in detail in "CatBoost Gradient Boosted Trees Implementation" section. In the seminal paper on CatBoost, “Catboost: unbiased boosting with categorical features” [2], Prokhorenkova et al. recommend using $\text {GBDT}$ algorithms with heterogeneous data. They write, “For many years, it [gradient boosting] has remained the primary method for learning problems with heterogeneous features, noisy data, and complex dependencies: web search, recommendation systems, weather forecasting, and many others...” Heterogeneous datasets contain features with different data types. Tables in relational databases are often heterogeneous. The opposite of heterogeneous data is homogeneous data. Homogeneous data is data that is all the same type. For example, a dataset of features composed of pixel color intensity values is homogeneous. Such data may be multidimensional, but the components of each dimension are all the same type of data. Some works we survey here give empirical evidence for Prokhorenkova et al. claim that $\text {GBDT}$ algorithms yield better performance than other $\text {ML}$ algorithms on tasks for heterogeneous data. Other works we review show that $\text {GBDT}$ algorithms tend not to do as well as $\text {ML}$ alternatives such as neural networks on tasks involving homogeneous data. However, research into applying neural networks to heterogenous data [3, 4], is an active area of research. Therefore, researchers should give consideration to the nature of the data they intend to use for $\text {ML}$ implementations. It may be a mistake to consider only $\text {GBDT}$ algorithms if the data is homogeneous, and it may also be a mistake to ignore $\text {GBDT}$ algorithms if the data is heterogeneous.

In the interdisciplinary segment, we provide examples of experiments that will guide the reader in avoiding these mistakes. However, we feel the concept is important enough to merit immediate coverage here. Matsusaka et al. in “Prediction model of aryl hydrocarbon receptor activation by a novel qsar approach, deepsnap–deep learning” published a study that compares the performance of Gradient Boosted $\text {ML}$ algorithms to deep learning algorithms [5]. In their study, the authors report on the results of applying these algorithms to digital image data, that is, homogeneous data. The authors document that a deep learning algorithm gives better performance in terms of Area Under the Receiver Operating Characteristic Curve ($\text {AUC}$) and accuracy. This is not surprising to us, since Matsusaka et al. are evaluating the performance of these algorithms on homogeneous data. Matsusaka et al. results serve as a reminder to researchers applying $\text {ML}$ algorithms to homogeneous data to consider that gradient boosted algorithms may not be the best choice. Below, we cover multiple studies that confirm the same idea: CatBoost is a good solution for problems involving heterogeneous data, but may not be the optimal learner for problems involving homogeneous data. To put it succinctly, we find CatBoost is best suited to heterogeneous data.

Apart from the degree of heterogeneity of one’s data, a researcher working with Big Data [6, pp. 12–13] must also consider the time complexity of $\text {ML}$ algorithms. When working with large datasets, small differences in the time required to execute high frequency operations can result in large differences in the total time required to conduct experiments. Three studies we cover in detail, Prokhorenkova et al. [2], Spadon et al. [7] and Anghel et al. [8], show mixed results on the training time consumption of CatBoost and XGBoost [9]. We believe this is due to differences in hyper-parameters that the authors use to configure the learning algorithms. We also cover scenarios that show where researchers may trade running time for accuracy by using CatBoost or an alternative. Overall, we find the mixed results for running time complexity of CatBoost versus other learners that we hypothesize is rooted in CatBoost’s sensitivity to hyper-parameter settings.

We find one study that highlights CatBoost’s sensitivity to hyper-parameter settings that may shed some light on the discrepancies in the training time performance of CatBoost and other learners that we discover later in this review. This study is “Benchmarking and optimization of gradient boosting decision tree algorithms” by Anghel et al. [8]. In this study the authors document training time and accuracy for CatBoost, XGBoost, and LightGBM [10] on four benchmarking tasks involving large datasets. Figure 1, copied from [8, Fig. 2], contains plots of training times versus maximum validation scores during hyper-parameter optimization. It shows how training times vary widely as Anghel et al. change the algorithms’ hyper-parameters during optimization. We find Panel b interesting. This is where the authors report the results for the algorithms on the Epsilon^{Footnote 1} benchmark. On the left side of Panel B, we see that some for some hyper-parameter configurations, CatBoost yields a maximum validation score sometime between 10 and 100 min, but for other configurations, it takes nearly 1000 min. In [2], Prokhorenkova et al. compare running time of CatBoost and XGBoost on a task involving the Epsilon dataset. However, XGBoost is missing from Panel b of Fig. 1. Anghel et al. report that they were unable to run XGBoost on the Epsilon benchmark due to memory constraints. That impediment to running XGBoost is an indication that, under the methods of their experiment, XGBoost consumes more memory than CatBoost for the same task. We include this result to emphasize that one may find it necessary to adjust CatBoost’s hyper-parameter settings in order to obtain the best results for CatBoost in terms of training time.

The application of GBDT algorithms for classification and regression tasks to many types of Big Data is well studied [11,12,13]. To the best of our knowledge, this is the first survey specifically dedicated to the CatBoost implementation of $\text {GBDT}$’s. Since its debut at the December 2018 Advances in Neural Information Processing Systems ($\text {NIPS}$) conference [2], researchers have conducted many experiments involving CatBoost. A number of these studies either involve Big Data, or techniques that will scale to Big Data. Hence, it is time for a review of these studies from a Big Data perspective. Researchers that work in Big Data environments often do so with a particular distributed framework, such as Apache Spark [14]. Some of these frameworks include GBDT implementations. For example, Spark MLlib’s GradientBoostedTrees module, [15], is one such implementation. For examples of GBDT applications in Spark please see [16] and [11] . However, as long as the distributed framework supports a language that the Gradient Boosted Decision Tree implementation has an application programming interface $\text {API}$ available for, it is possible to use that implementation in the framework; thus, freeing the user to select from the most appealing GBDT implementation available. For researchers wishing to employ CatBoost with very large datasets, one viable approach is to fit a CatBoost model to a representative sample using the CatBoost Python API, then apply a CatBoost model to the larger dataset using a distributed framework such as Spark or Hadoop [17] with CatBoost’s Java API. We provide this one example to show applying CatBoost to large datasets with popular distributed frameworks is feasible. However, we recognize that there exists a multitude of distributed frameworks suitable for Big Data that, in turn, support a myriad of programming languages. So, there should be many more valid approaches to applying trained CatBoost models to Big Data.

Researchers in disparate domains find applications for CatBoost. We find works in the fields of Astronomy [18], Finance [19,20,21,22], Medicine [23,24,25,26], Biology [27, 28], Electrical Utilities Fraud [29,30,31], Meteorology [32, 33], Psychology [34, 35], Traffic Engineering [7, 36], Cyber-security [37], Bio-chemistry [5, 38], and Marketing [39]. Therefore, a good understanding of CatBoost may provide one the opportunity to participate in interdisciplinary research. Our third finding is that the wide range of subjects where CatBoost is applicable is evidence that it is a general-purpose algorithm that behooves researchers to understand. On the other hand, as the works we survey demonstrate, CatBoost works better in some situations than others. We take an interdisciplinary approach to study different subject areas where researchers use CatBoost. For each of the subject areas we list here, we provide a section that details how researchers use CatBoost in that specific domain.

Before we cover applications of CatBoost in various domains, we discuss our search method, we cover related works, and then provide an overview of the GBDT ensemble technique, and the CatBoost implementation of GBDT’s. We touch on another GBDT implementation, LightGBM [10]. Like CatBoost, LightGBM has built-in support for encoding categorical variables. XGBoost is another GBDT implementation without built-in support for categorical features, so we choose not to give details on it. First of all, we provide details on the method we use to discover articles we cover.

Search method

We used our University library database, Google Scholar [40], and the Web of Science [41] databases to search for the term “CatBoost.” We obtained results with 278 articles from OneSearch, 25 articles from Web of Science, and the first 100 results from Google Scholar. We then conducted a manual review of the 403 records resulting from the search. During the manual review we retained only the studies related to CatBoost and its applications. We do not include works where the authors mention CatBoost, but do not employ it in any experiment. We do not limit our search results to any specific subject area.

Related work

In order to find related work, we review all studies retrieved using the search method detailed in the previous section, looking specifically for surveys on CatBoost. We did not find such a study. To the best of our knowledge, this is the first review that focuses exclusively on research involving the CatBoost implementation of GBDT’s. We therefore expanded our search for surveys on Gradient Boosted techniques, and find two related studies.

Prior to the introduction of CatBoost, Sagi and Rokach published “Ensemble learning: a survey” [42]. This work is broader in scope and covers ensemble methods in general. It was published in 2018, and includes a discussion of Gradient Boosted Decision Tree algorithms, but not CatBoost.

Another related work is “A survey of classification techniques in data mining” by Sujatha and Prabhakar [43]. This study also covers a broader range of $\text {ML}$ algorithms than what we cover here. Sujatha and Prabhakar published this study in 2013, prior to the release of CatBoost. Furthermore, it does not provide the depth of detail on GBDT algorithms that we go into here.

The absence of a survey of research where CatBoost is used, and the abundance of recent work involving CatBoost, indicates to us that a survey of these works is timely. A thorough understanding of GBDT’s and CatBoost is necessary before one delves into the different ways researchers apply CatBoost in various fields. Therefore, we continue with a review of GBDT’s and the CatBoost implementation of GBDT’s. After that, we conduct the interdisciplinary review, grouping coverage of works by field. From this perspective, one may see how to apply CatBoost given a problem in the same domain.

Gradient Boosted Decision Trees

Jerome H. Friedman describes Gradient Boosting in the study titled “Greedy function approximation: a gradient boosting machine” [44]. In his paper, Friedman describes the Gradient Boosting $\text {ML}$ technique. Since it is a supervised $\text {ML}$ technique, we begin with a set $\left\{ {\mathbf {x}}_i, y_i\right\}$ of input values ${\mathbf {x}}_i$, and expected output values ${y_i}$, $i \in \left\{ 1 \ldots n\right\}$. Gradient boosting takes the approach of iteratively constructing a collection of functions $F^0, F^1, \ldots , F^t, \ldots , F^m$, given a loss function ${\mathcal {L}}\left( y_i, F^t\right)$. Here we would like to emphasize that ${\mathcal {L}}$ has two input values, the ith expected output value $y_i$, and the tth function $F^t$ that estimates $y_i$. Assuming we have constructed function $F^t$ we can improve our estimates of $y_i$ by finding another function $F^{t+1} = F^t + h^{t+1}\left( {\mathbf {x}}\right)$ such that $h^{t+1}$ minimizes the expected value of the loss function. That is,

$$\begin{aligned} h^{t+1} = \underset{h \in H}{\mathrm {argmin}} {\mathbb {E}} {\mathcal {L}}\left( y, F^t\right) . \end{aligned}$$

(1)

Where H is the set of candidate Decision Trees we are evaluating to choose one to add to the ensemble. Furthermore, by the definition of $F^{t+1}$, we can write the expected value of the loss function ${\mathcal {L}}$ in terms of $F^t$ and $h^{t+1}$:

$$\begin{aligned} {\mathbb {E}} {\mathcal {L}}\left( y, F^{t+1}\right) = {\mathbb {E}} {\mathcal {L}}\left( y, F^{t}+ h^{t+1} \right) \end{aligned}$$

(2)

One may notice that the right-hand side of Eq. (2) implies we wish to minimize the loss function’s value on y and $F^{t}$ plus something. If we assume ${\mathcal {L}}$ is continuous, and differentiable, we can add something related to the rate of change of ${\mathcal {L}}$ to $F^t$ to shift its value somewhere in the direction that ${\mathcal {L}}$ is decreasing. Therefore, if we set $h^{t+1}$ to values in the direction that the gradient of ${\mathcal {L}}$ with respect to $F^t$ is decreasing the fastest, we would have the $h^{t+1}$ that approximately minimizes ${\mathbb {E}} {\mathcal {L}}\left( y, F^t+h^{t+1}\right)$. Under these assumptions then we can write a reasonable approximation for $h^{t+1}$,

$$\begin{aligned} h^{t+1} \approx \underset{h \in H}{\mathrm {argmin}} {\mathbb {E}}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2. \end{aligned}$$

(3)

We refer to this technique as Gradient Boosting because we use the partial derivatives (gradients) of the loss function ${\mathcal {L}}$ with respect to the function $F^t$ to find $h^{t+1}$. Prokhorenkova et al. [2] point out that we may not have an easy way to compute $\underset{h \in H}{\mathrm {argmin}} {\mathbb {E}}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2$. This could be because it would be difficult, in general, to say what the probability of specific values of $\underset{h \in H}{\mathrm {argmin}} {\mathbb {E}}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2$ should be, and we may not know what $F^t$ should be because we could be using stochastic techniques, such as some algorithm to construct a Decision Tree to define $F^t$. However, we can assume, as Prokhorenkova et al. suggest,

$$\begin{aligned} \underset{h \in H}{\mathrm {argmin}} {\mathbb {E}}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2 \approx \underset{h \in H}{\mathrm {argmin}} \frac{1}{n}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2. \end{aligned}$$

(4)

Although we are covering Friedman’s Gradient Boosting Decision Trees technique in this section, we use this reference to Prokhorenkova et al. in our explanation, since our ultimate goal is to provide the reader a clear understanding of CatBoost.

We can take approximations (3) and (4) to obtain a concrete estimate for $h^{t+1}$:

$$\begin{aligned} h^{t+1} \approx \underset{h \in H}{\mathrm {argmin}} \frac{1}{n}\left( \frac{\partial {\mathcal {L}} y}{\partial F^t} - h \right) ^2. \end{aligned}$$

(5)

For GBDT’s the base case $F^0$ is a Decision Tree, and the $h^1, h^2, \ldots , h^t, \ldots h^m$ are also Decision Trees. When we add a Decision Tree to construct $F^{t+1}$ in this manner, the expected value of the loss function ${\mathbb {E}} {\mathcal {L}}\left( y, F^{t+1}\left( {\mathbf {x}}\right) \right)$ shrinks, implying that the estimates $F^{j+1}\left( {\mathbf {x}}_i\right)$ are better than the estimates $F^{j}\left( {\mathbf {x}}_i\right)$. CatBoost, as well as other currently popular GBDT techniques XGBoost and LightGBM, make refinements to the Gradient Boosting technique Friedman describes in [44]. Researchers who have a good understanding of how the GBDT technique works have a better chance of successfully applying it in any discipline. Similarly, researchers who know how CatBoost carries out the GBDT technique are better equipped to employ it in any domain. Therefore, we provide details on CatBoost in the next section.

CatBoost Gradient Boosted Trees Implementation

In [2], Prokhorenkova et al. propose the CatBoost algorithm, and compare it with XGBoost and LightGBM. In their description of the CatBoost learner, they cover their refinements to the GBDT algorithm Friedman describes in [44]. Here we cover these refinements and some related hyper-parameters that users should be aware of since the related hyper-parameters’ values may also affect the resources CatBoost consumes.

CatBoost’s first refinement to Gradient Boosting is the manner in which it deals with high cardinality categorical variables. For low cardinality categorical variables, CatBoost uses one-hot encoding. The precise definition of low cardinality depends on the computing environment and whether the user is employing CatBoost in any specialized modes. The current version of CatBoost at the time of this writing, version 0.23.2, has a default value of 255 under some conditions when running on GPU’s, and 2 when running on CPU’s provided certain other specific conditions are not met. This is an obvious, yet non-trivial example of CatBoost’s sensitivity to hyper-parameters. One may obtain different results in terms of running time and other performance metrics since changing this hyper-parameter not only alters the type of processor CatBoost will use, but also the manner in which it will encode categorical features. We refer the reader to the CatBoost API documentation^{Footnote 2} for further details on how CatBoost sets the threshold for one-hot encoding.

In [2], Prokhorenkova et al. use the term “Ordered Target Statistic” to refer to the technique CatBoost uses for encoding categorical variables, when CatBoost is not using one-hot encoding. Micci-Barreca introduces target statistics in “A preprocessing scheme for high-cardinality categorical attributes in classification and prediction problems” [45]. A target statistic is a value we calculate from the ground truth output values associated with particular values of a categorical attribute in a dataset. One strategy for dealing with categorical variables in $\text {ML}$ is to replace the categorical values of a feature with a target statistic.

The most important concept in the Ordered TS calculation is rooted in the distinction between training and test datasets. Let ${\mathcal {D}}$ be the set of all data available to train and evaluate our GBDT ensemble. The Decision Tree $h^{t+1}$ we add to the ensemble is the Decision Tree that minimizes the expected value of the loss function ${\mathbb {E}} {\mathcal {L}}$. We wish to use some data in ${\mathcal {D}}$ for fitting the Decision Tree $h^{t+1}$, and some data for finding the $h^{t+1}$ that minimizes ${\mathbb {E}} {\mathcal {L}}$. Our motivation for using the data in ${\mathcal {D}}$ in this way is to avoid what Prokhorenkova et al. define as “target leakage” [2]. We explain more on target leakage below; however, we finish our description of CatBoost’s encoding technique first. The way CatBoost chooses the data to use for fitting $h^{t+1}$ is to place an arbitrary order on the elements of ${\mathcal {D}}$ with a random permutation $\sigma$. Let $\sigma \left( k\right)$ be the kth element of ${\mathcal {D}}$ under $\sigma$, and ${\mathcal {D}}_k = \left\{ {\mathbf {x}}_1, {\mathbf {x}}_2, \ldots , {\mathbf {x}}_{k-1} \right\}$, ordered by the random permutation $\sigma$. CatBoost uses ${\mathcal {D}}_k$ as the data for fitting the Decision Tree $h^{t+1}$, and ${\mathcal {D}}$ as the data for evaluating whether $h^{t+1}$ is the Decision Tree that minimizes ${\mathbb {E}} {\mathcal {L}}\left( y, F^{t}+ h^{t+1} \right)$. The meaning of the notation ${\mathcal {D}}_k$ is the first important concept for understanding how CatBoost encodes the values of categorical variables.

The second important concept for understanding how CatBoost encodes the values of categorical variables is the indicator function $\mathbb {1}$. The indicator function $\mathbb {1}_{a=b}$ is a function of one variable a that has the value 1 when $a=b$, and 0 otherwise. The indicator function plays an important role in the formula CatBoost applies to map the values of a categorical feature to a numerical value. Specifically, this formula involves the indicator function $\mathbb {1}_{x_j^i=x_k^i}$. This indicator function takes the value 1 when the ith component of CatBoost’s input vector ${\mathbf {x}}_j$ is equal to the ith component of the input vector ${\mathbf {x}}_k$. Here we use k as in the kth element according to the order we put on ${\mathcal {D}}$ with the random permutation $\sigma$, and i takes on the integer values 1 through $k-1$.

Understanding these key concepts of the training data ${\mathcal {D}}$ and the indicator function $\mathbb {1}_{x_j^i=x_k^i}$, enables us to define the formula for the encoded value, ${\hat{x}}_i^k$, of the ith categorical variable of the kth element of ${\mathcal {D}}$ as:

$$\begin{aligned} {\hat{x}}_k^i = \frac{\sum _{x_j \in {\mathcal {D}}_k} \mathbb {1}_{x_j^i = x_k^i} \cdot y_j + a p}{\sum _{x_j \in {\mathcal {D}}_k} \mathbb {1}_{x_j^i = x_k^i} + a}. \end{aligned}$$

(6)

Prokhorenkova et al. define p as a prior commonly set to the average value of the label in the dataset, and a as a parameter greater than 0. We do not see a clear suggestion for the value of a [2]. However, one can see that setting a to a value greater than 0 in Eq. (6) ensures we will not divide by 0 in the case that none of the values $x_j^i$ equal $x_k^i$. Also, in that case, any value $a > 0$ guarantees ${\hat{x}}_k^i$ gets the value p.

CatBoost applies Eq. (6) when fitting the Decision Tree $h^{t+1}$, but uses a variation of it when evaluating $h^{t+1}$ to determine if it is the Decision Tree that minimizes ${\mathbb {E}} {\mathcal {L}}\left( y, F^{t}+ h^{t+1} \right)$. The variation on Eq. (6) is that instead of using the subset ${\mathcal {D}}_k$, it uses the entire set ${\mathcal {D}}$.

Now that we have an understanding of how CatBoost encodes categorical variables, we can understand why it uses this technique. As we mention above, CatBoost encodes categorical values in order to alleviate the problem of target leakage. Prokhorenkova et al. write that CatBoost avoids target leakage because the technique it uses for encoding categorical variables has a certain property, that they express in Eq. (7)

$$\begin{aligned} {\mathbb {E}} \left( {\hat{x}}^i |y=v \right) = {\mathbb {E}} \left( {\hat{x}}_k^i |y_k=v \right) . \end{aligned}$$

(7)

Interestingly, the way CatBoost’s encoding technique satisfies this property is to ensure we do not use the value $y_k$ in Eq. (6). Prokhorenkova et al. explain that if we use $y_k$ to encode features in ${\mathbf {x}}_k$ we create target leakage [2]. They define target leakage in terms of conditional shift. Noting that Eq. (7) involves conditional probabilities, we see that if Eq. (7) does not hold, it means that the expected value of all encoded values for the ith feature given a specific output value v does not equal the expected value of the encoded values for some training examples $\left( {\mathbf {x}}_k,y_k\right)$. In other words, when Eq. (7) does not hold, the expected encoded value ${\hat{x}}_k^i$ is shifted under the condition $y_k=v$. This is an overfitting condition in the sense that in the fitting process the model can exploit the correlation between ${\hat{x}}_k$ and $y_k$ during training, but the correlation will not exist during testing due to the difference in expected values when Eq. (7) does not hold. The way they suggest avoiding the shifting of the expected values under the conditions $y = v$ and $y_k = v$ is to exclude the value of $y_k$ in the computation of values for ${\hat{x}}^i$ when encoding the value $x_k^i$; hence, the definition of ${\mathcal {D}}_k$ above, and its role in computing the value of ${\hat{x}}_k^i$ in Eq. (6) above.

The second property of the Ordered TS that Prokhorenkova et al. describe is that it eventually uses all training examples $\left( {\mathbf {x}}_k, y_k\right)$. This property ensures that after sufficient iterations, we have encoded categorical values with all the information available in the training data. This second property balances the overfitting protection of the first property, to ensure we are not underfitting, because we are using all the available training data.

The way Prokhorenkova et al. enforce this property is another refinement to Gradient Boosting that they call the Ordered Boosting technique. Target leakage not only causes a conditional shift in the expected value of an encoded variable, but also it causes prediction shift in the expected value of the residuals we wish to minimize. To see why this is so, consider Approximation (5), and assume we are using CatBoost’s Ordered Target Statistic technique to encode some categorical variables to build the Gradient Boosted Decision Trees that constitute $F^{t+1}$. Then, because we are using Ordered Target Statistics to encode categorical variables, $\frac{\partial {\mathcal {L}} y}{\partial F^t}$ is also a random variable because we use the random permutation $\sigma \left( k\right)$ to choose the elements of $D_k$ to encode categorical variables that influence the value of $F^t$. Therefore, the distribution of $\frac{\partial {\mathcal {L}} y}{\partial F^t}$ can be shifted under the condition that we calculated $\frac{\partial {\mathcal {L}} y}{\partial F^t}$ with a particular encoding for $x_k^i$. Prokhorenkova et al. explain that this conditional shift leads to bias in the estimate we make for $h^{t+1}$, and that negatively impacts the metrics we obtain when evaluating of $F^{t+1}$ on data we did not use at training time. Prokhorenkova et al. refer to the impact on $F^{t+1}$ as its generalization ability. To combat this impact on $F^{t+1}$’s generalization ability, Prokhorenkova et al. propose Ordered Boosting. The key concept in Ordered Boosting is to use the same examples in ${\mathcal {D}}_k$ that we use to compute the Ordered Target Statistics, to compute the estimates for $h^{t+1}$, which means we must use them to compute the values of $\frac{\partial {\mathcal {L}} y}{\partial F^t}$. The reader will recall that ${\mathcal {D}}_k = \left\{ {\mathbf {x}}_1, {\mathbf {x}}_2, \ldots , {\mathbf {x}}_{k-1}\right\}$ depends on where we are at in iterating through the permutation $\sigma$ of the elements of ${\mathcal {D}}$. In other words, when we start with $k=1$, ${\mathcal {D}}_k$ will have one element in it. This means we will have a high variance in values we estimate for $\frac{\partial {\mathcal {L}} y}{\partial F^t}$. So, in Ordered Boosting, CatBoost uses multiple, independent permutations $\sigma _1, \sigma _2, \ldots , \sigma _s$ of ${\mathcal {D}}$ to compute a number of sets of residual values that it can use to find $h^{t+1}$, to obtain $F^{t+1}$, and maintain the guarantee that none of the values of $x_k^i$ are used to compute the values of the gradients $\frac{\partial {\mathcal {L}} y}{\partial F^t}$. At the same time using these multiple sets of residuals reduces the variance in CatBoost’s estimates of $\frac{\partial {\mathcal {L}} y}{\partial F^t}$. This is how Ordered Boosting avoids prediction shift.

Another important concept in CatBoost’s process of building Decision Trees is Oblivious Decision Trees ($\text {ODT}$’s). CatBoost constructs an ensemble of ODT’s. ODT’s are full binary trees, so if the ODT has n levels, it will have $2^n$ nodes. Furthermore, all non-leaf nodes of the ODT will have the same splitting criteria. To assist the reader’s understanding, in Table 1, we include a diagram of an ODT from Lou and Obukhov, “Bdt: gradient boosted decision tables for high accuracy and scoring efficiency” [46]. According to Prokhorenkova et al., ODT’s “...are balanced, less prone to overfitting, and allow speeding up execution at testing time significantly” [2]. We see ODT’s are balanced by definition. Since they are full binary trees the number of comparisons to reach a leaf node is the minimum number of comparisons to reach the maximum number of leaf nodes, so we agree that ODTs may yield more efficient executions than deeper Decision Trees that are not completely filled. The trade-off is that one must be careful in setting the maximum tree depth in CatBoost since the amount of memory CatBoost will use may grow by a factor of 2 times the number of trees in the ensemble for every unit of increase in the maximum tree depth. This is another example of CatBoost’s sensitivity to hyper-parameter settings that researchers should be aware of since it can have an impact on the amount of memory and running time their experiments consume. Perhaps the differences in running time complexity we see are rooted in improper values for this hyper-parameter.

Table 1 Oblivious Decision Tree example from Lou and Obukhov demonstrating a Decision Tree and Decision Table that provide equivalent logic [46]

CatBoost for big data: an interdisciplinary review

Abstract

Introduction

Search method

Related work

Gradient Boosted Decision Trees

CatBoost Gradient Boosted Trees Implementation

LightGBM Support for Categorical Variables

CatBoost applications by field

Tables of works studied

Astronomy

Finance

Medicine

Electrical utilities fraud

Meteorology

Psychology

Traffic Engineering

Cyber-security

Bio-chemistry

Marketing

Biology

Conclusions

Availability of data and materials

Notes

Abbreviations

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords