Skip to main content

Data-driven multinomial random forest: a new random forest variant with strong consistency

Abstract

In this paper, we modify the proof methods of some previously weakly consistent variants of random forest into strongly consistent proof methods, and improve the data utilization of these variants in order to obtain better theoretical properties and experimental performance. In addition, we propose the Data-driven Multinomial Random Forest (DMRF) algorithm, which has the same complexity with BreimanRF (proposed by Breiman) while satisfying strong consistency with probability 1. It has better performance in classification and regression tasks than previous RF variants that only satisfy weak consistency, and in most cases even surpasses BreimanRF in classification tasks. To the best of our knowledge, DMRF is currently a low-complexity and high-performing variation of random forest that achieves strong consistency with probability 1.

Introduction

Random Forest (RF, also called standard RF or BreimanRF) [1] is an ensemble learning algorithm that makes classification or regression predictions by taking the majority vote or average of the results of multiple decision trees. Due to its simple and easy-to-understand nature, rapid training, and good performance, it is widely used in many fields, such as data mining [2,3,4], computer vision [5,6,7], ecology [8, 9], and bioinformatics [10].

Although the RF has excellent performance in practical problems, analyzing its theoretical properties is quite difficult due to its highly data-dependent tree-building process. These theoretical properties include consistency, which can be weak or strong. Weak consistency refers to the expectation of the algorithm's loss function converges to the minimum value as the data size tends to infinity, while strong consistency refers to the algorithm's loss function converges to the minimum value as the data size tends to infinity [11]. Consistency is an important criterion for evaluating whether an algorithm is excellent, especially in the era of big data.

Many researchers have made important contributions to the discussion of consistency-related issues in RF, proposing many variants of RF with weak consistency, such as Denil14 (also called Poisson RF) [12], Bernoulli RF (BRF) [13], and Multinomial RF (MRF) [14]. However, the common feature of these algorithms is that the selection of split points (a split point consists of a splitting feature and a splitting value) and the determination of the final leaf node labels during the tree-building process are independent, i.e., using a part of the training set samples to train the split points and the remaining part to determine the leaf node labels, which to a large extent causes insufficient growth of decision trees. In addition, the introduction of random distributions such as Poisson distribution, Bernoulli distribution, and multinomial distribution in these algorithms can enhance their robustness but also have a certain impact on their performance.

In this paper, we propose a new variant of random forest called the Data-driven Multinomial Random Forest (DMRF) which has strong consistency, based on the foundation of MRF and BRF with weak consistency. The term "Data-driven" in this context does not mean that other variants of RF are not depend on data, but rather indicates that DMRF can make more effective use of data compared to the aforementioned variants with weak consistency. In DMRF, we incorporate a new bootstrapping (slightly different from the standard bootstrapping in BreimanRF) that was not included in the previous variants, and introduce a Bernoulli distribution when splitting nodes. This Bernoulli distribution determines whether using optimal splitting criterion to obtain the splitting point or sampling the splitting point using two multinomial distributions based on impurity reduction. The reason for introducing the multinomial distribution is that it can perform random sampling of the optimal splitting feature and feature value with maximum probability [14].

Related work

BreimanRF [1] is an ensemble algorithm based on the prediction results of multiple decision trees, proposed by Breiman. It has shown satisfactory performance in practical applications. The basic process of BreimanRF can be divided into three steps: first, use bootstrapping to resample the dataset for the same number of times as the size of the dataset to obtain the training set for the base tree; second, randomly sample a feature subspace of size \(\sqrt D\) without replacement from the entire feature space of size \(D\), and evaluate the importance of each feature and feature value in the subspace based on the reduction in impurity (e.g., Information entropy or Gini index) to obtain the optimal splitting point. Recursively repeat this process until the stopping condition is met, and a decision tree is obtained. Finally, repeat the above process to train multiple decision trees, and take the majority vote (for classification problems) or average (for regression problems) of the results of multiple decision trees to obtain the final prediction result.

Since the proposal of the RF model, many variants have been developed, such as Rotation Forest [15], Quantile RF [16], Skewed RF [17], Deep Forest [18], and Neural RF [19]. These variants are proposed to further enhance the interpretability, performance, and efficiency of the random forest. Although the practical research on RF has developed rapidly, the progress of its theoretical exploration is slightly lagging behind. Breiman proved that the performance of BreimanRF is jointly determined by the correlation between decision trees and the performance of decision trees, i.e., the greater the diversity and the better the performance of trees, the better the performance of RF [1].

The important breakthrough in the study of consistency of RF was proposed by Biau et al. [20]. Biau proposed two simplified algorithms of BreimanRF: Purely Random Forest and Scale-Invariant Random Forest. Purely Random Forest randomly selects a feature and its feature value as the splitting feature and splitting value at each node. Scale-Invariant Random Forest also randomly selects a feature as the splitting feature at each node and randomly divides the samples into two parts according to the order of the feature values of that feature. Biau proved that both of these simplified versions have consistency.

Biau [21] proved another simplified RF model that is closer to BreimanRF and has weak consistency: randomly selecting a feature subspace at each node, and for each candidate feature, selecting the midpoint as the splitting value. When selecting among candidate features, the splitting feature and splitting value with the maximum reduction in impurity are selected to grow the tree.

Denil et al. [12] proposed a new RF variant, Denil14. Denil14 divide the training set into structural part and estimation part. The structural part is used only for training the split points, and the estimation part is used only to determine the labels of the leaf nodes. In addition, at each node, the size of the feature subspace is selected based on the Poisson distribution, and the optimal splitting feature and splitting value are searched from \(m\) structural part samples that have been pre-selected. This variant has been proven to have weak consistency. Denil14 can be used for classification.

Inspired by the Denil14 model, Yisen Wang et al. [13] proposed a new Bernoulli RF (BRF) based on the Bernoulli distribution. Similar to Denil14, BRF divides the dataset into structural part and estimation part, trains the splitting points using the structural part, and determines the leaf node labels using the estimation part. However, BRF introduces two Bernoulli distributions at each node: one to determine whether the feature subspace size is 1 or \(\sqrt D\), and the other to determine whether the splitting value of each candidate feature is randomly selected or using the optimal splitting criterion. Finally, they proved that BRF also has weak consistency, but it is closer to BreimanRF and has better performance than previous RF variants with weak consistency. BRF can be used for both classification and regression.

Jiawang Bai et al. [14] transformed the reduction of each feature and its impurity value into probabilities through the softmax function, proposing a Multinomial Random Forest (MRF). MRF also divides the dataset into structural and estimation parts, using the structural part to train the splitting points and the estimation part to determine the leaf node labels. Training the splitting point primarily involves two steps: first, calculating the maximum impurity reduction for each feature at each node and converting it into a probability, which is then considered the probability of the multinomial distribution from which the splitting feature is randomly selected; second, when selecting the splitting value, converting the impurity reduction of each feature value of the split feature obtained in the previous step into a probability and regarding it as the probability of the multinomial distribution from which the splitting value is randomly selected. Regarding determining the leaf node labels, MRF views the proportion of each class of estimation part samples in the leaf node as a probability and randomly selects a class as the label from the multinomial distribution. MRF has more purposeful random selection of splitting features and values, with more reasonable probability allocations. It currently has the best performance among the consistent variants of RF, even surpassing BreimanRF. The disadvantage is its high computational complexity, it can cause large computational cost. MRF is only used for classification.

Due to the fact that Denil14, BRF, and MRF all chose to make the training of split points and leaf node labels independent to achieve weak consistency, this inevitably affects the performance of the base trees and reduces the overall performance of the algorithm. Based on this, we propose the Data-driven Multinomial Random Forest (DMRF) algorithm that can be used for both classification and regression problems.DMRF can directly select the sample for the training of the splitting point to determine the leaf node labels and introduce bootstrapping to increase the diversity between trees. Moreover, we enhance the weak consistency to strong consistency by modifying the conditions of the variants mentioned above. We found that although the theoretical basis is different, the proof methods are quite same as before. This means that the method for proving weak consistency can be strengthened to the method for proving strong consistency, resulting in a strong conclusion.

The arrangement of this paper is approximately as follows: Sect. 3 provides a detailed introduction to the classification and regression DMRF algorithm and proves its strong consistency; Sect. 4 provides some explanations of the experiments; Sect. 5 presents the experimental results and analysis; Sect. 6 concludes the paper and provides a future outlook on the work.

The proposed DMRF algorithm

Classification DMRF

Let \(D_{n} = \{ (X_{1} ,Y_{1} ),(X_{2} ,Y_{2} ),...,(X_{n} ,Y_{n} )\}\) denotes a dataset where \(X_{i} \in {\mathcal{R}}^{D}\) indicates \(D\)-dimensional features, \(Y_{i} \in \{ 1,2,...,c\}\) indicates the label, \(i \in \{ 1,2,...,n\}\), we are preparing to build \(M\) trees.

Training sample sampling

In the DMRF, we use a slightly different bootstrapping than the standard one to sample the training set \(D_{n}^{(j)}\)\(j \in \{ 1,2,...,M\}\) for the \(j\)-th tree. Specifically, during sampling, we do not resample all samples, but instead sample each sample with a probability of \(q\) (which may be related to \(n\), in this paper, we choose a constant value in (0,1]). That is, the sampling of each sample follows a binomial distribution \(\mathcal{B}(1, q)\), and the probabilities \(q\) among different samples are independent. If the sampled training set is empty, we resample again.

Split point training process

First, let's introduce a very important function: the softmax function.

Definition 3.1:

Given a \(n\)-dimensional vector \(v = (v_{1} ,v_{2} ,...,v_{n} )\), the softmax function is defined as follows:

$$soft\max (v) = (e^{{v_{1} }} ,e^{{v_{2} }} \ldots ,e^{{v_{n} }} )/\sum\limits_{i = 1}^{n} {e^{{v_{i} }} } ,$$

where \(e\) is the base of the natural logarithm. Obviously, after the transformation by softmax function, the elements of the vector are all numbers between 0 and 1, and their sum is equal to 1. Therefore, they can be seen as probabilities.

Reviewing the growth process of a classification tree in BreimanRF: In the node \({\mathcal{D}}\)\(\sqrt D\) features are randomly selected to form a feature subspace, which denote as \(\{ A_{1} ,A_{2} ,...,A_{\sqrt D } \}\)(Without loss of generality, we assume that \(\sqrt D\) is an integer. Otherwise, we can round it down to the nearest integer). These selected features are also referred to as candidate split features. Let \(V = \{ v_{ij} \}\) denotes all possible split points for the node \({\mathcal{D}}\)\(v_{ij}\) representing the \(A_{j}\)'s \(i\)-th feature value (i.e., threshold) in the feature subspace, \(i \in \{ 1,2,...,m_{j} \}\),\(j \in \{ 1,2,...,\sqrt D \}\)\(m_{j}\) indicates the number of feature values for \(A_{j}\). Let \(I_{ij}\) denote the impurity reduction obtained by the split point \(v_{ij}\) at that node,

$$I_{ij} = I({\mathcal{D}},v_{ij} ) = T({\mathcal{D}}) - \frac{{|{\mathcal{D}}^{l} |}}{{|{\mathcal{D}}|}}T({\mathcal{D}}^{l} ) - \frac{{|{\mathcal{D}}^{r} |}}{{|{\mathcal{D}}|}}T({\mathcal{D}}^{r} ),$$
(1)

where \(T({\mathcal{D}})\) denote impurity criteria of node \({\mathcal{D}}\), such as Gini index or Information entropy (in this paper, we use the Gini index), are used for measuring impurity, and \({\mathcal{D}}^{l}\) and \({\mathcal{D}}^{r}\) respectively represent the left and right child nodes obtained by the split at that node.

The impurity reduction obtained by calculating different feature values of feature \(A_{j}\) as split values forms a vector \(I^{(j)} = (I_{1,j} ,I_{2,j} ,...,I_{{m_{j} ,j}} )\)\(j \in \{ 1,2,...,\sqrt D \}\).

The maximum impurity reduction for each feature forms a vector

$$I = (I_{1} ,I_{2} ,...,I_{\sqrt D } ) = (\max I^{(1)} ,\max I^{(2)} ,...,\max I^{(\sqrt D )} ).$$

For a classification tree in BreimanRF, the splitting point is determined by the feature and feature value corresponding to the maximum impurity reduction, i.e., the splitting feature is the \(j\)-th feature, \(j = \arg \max \{ I_{1} ,I_{2} ,...,I_{\sqrt D } \}\) and the splitting feature value is the \(i\)-th feature value of feature \(A_{j}\)\(i = \arg \max I^{(j)}\). After splitting the root node into left and right child nodes, the above process is repeated in each child node until the stopping condition is met, the tree stops growing.

The construction of DMRF tree is different from the above tree construction process. The following parameters will be introduced: \(p\) is a probability, \(k_{n}\) is the minimum sample number in a node, \(B_{1}\), \(B_{2}\) are two positive finite parameters.

DMRF first randomly samples \(\sqrt D\) features from full feature space, then conducts a Bernoulli experiment \(B\) with probability \(p\) when splitting nodes, \(B\sim \mathcal{B}(1, p)\):

If \(B = 1\), the split feature and split value at this node are obtained according to the optimal split criterion.

If \(B = 0\), the split feature and split value at this node can be obtained by the following steps:

  1. Split feature selection

    1. I.

      Normalize \(I = (I_{1} ,I_{2} ,...,I_{\sqrt D } ) = (\max I^{(1)} ,\max I^{(2)} ,...,\max I^{(\sqrt D )} )\) as \(\tilde{I} = (I_{1} - \min I,I_{2} - \min I,...,I_{\sqrt D } - \min I)/(\max I - \min I);\)

    2. II.

      Compute the probablities \(\alpha = soft\max (B_{1} \tilde{I})\), where \(B_{1} \ge 0\);

    3. III.

      Randomly select a splitting feature according to the multinomial distribution \(M(\alpha).\)

  2. Split feature selection:

    Assuming \(A_{j}\) is the split feature which is selected from the previous step.

    1. I.

      Normalize \(I^{(j)} = (I_{1,j} ,I_{2,j} ,...,I_{{m_{j} ,j}} )\) as \(\tilde{I}^{(j)} = (I_{1,j} - \min I^{(j)} ,I_{2,j} - \min I^{(j)} ,...,I_{{m_{j} ,j}} - \min I^{(j)} )/(\max I^{(j)} - \min I^{(j)} );\)

    2. II.

      Compute the probablities \(\beta = soft\max (B_{2} \tilde{I}^{(j)} )\), where \(B_{2} \ge 0\);

    3. III.

      Randomly select a splitting value according to the multinomial distribution \(M(\beta )\).

Continue repeating the above steps until a stopping condition is met, i.e. the number of samples within a node is less than \(k_{n}\).

Leaf node label determination

When an unlabeled sample \(x\) is given and the prediction is made on it, \(x\) will fall on a leaf node of the tree according to the algorithm. In the tree, the probability that the \(x\) is predicted to be class \(k\) is

$$\gamma^{(k)} (x) = \frac{1}{{N({\mathcal{N}}(x))}}\sum\limits_{{(X,Y) \in {\mathcal{N}}(x)}} {{\mathcal{I}}(Y = k)} ,k = 1,2,...,c,$$
(2)

where \(I( \cdot )\) is 1 if \(\cdot\) is true and is 0 if \(\cdot\) is false; \({\mathcal{N}}(x)\) is the leaf node where \(x\) falls into, \(N({\mathcal{N}}(x))\) is the sample number of node \(N(x)\). According to the majority voting principle, the prediction of sample \(x\) under this tree is

$$\hat{y}(x) = \mathop {\arg \max }\limits_{k} \{ \gamma^{(k)} (x)\} .$$

The prediction of DMRF is the result of majority voting in the base tree, i.e

$$\overline{y}(x) = \mathop {\arg \max }\limits_{k} \sum\limits_{i = 1}^{M} {\mathcal{I}} (\hat{y}^{(i)} (x) = k),$$
(3)

where \(\hat{y}^{(i)} (x)\) is the predicted value of sample \(x\) in the \(i\)-th decision tree. If there are multiple categories with the same number of votes, we randomly select one of them as the final prediction class. The pseudo-code of DMRF algorithm and decision tree construction is as follows:

Algorithm 1
figure a

DMRF classification tree construction process: Tree()

Algorithm 2
figure b

DMRF classification algorithm

Strong consistency proof of classification DMRF

In this section, we prove the strong consistency of the classification DMRF algorithm, and the detailed proof process is in the appendix.

First, we talk about the consistency definition of classifier. For a classifier sequence \(\{ g_{n} \}\), the classifier \(g_{n}\) is obtained by training the data set \(D_{n} = \{ (X_{1} ,Y_{1} ),...,\)

\((X_{n} ,Y_{n} )\}\) which satisfying the distribution \((X,Y)\), and the error rate is

$$L_{n} = L(g_{n} ) = P(g_{n} (X,C,D_{n} ) \ne Y|D_{n} ),$$

where \(C\) is the randomness introduced in the training.

Definition 3.2

Given the training set \(D_{n}\) which contain \(n\) i.i.d observations, for a certain distribution \((X,Y)\), call classifier \(g_{n}\) is weakly consistent if \(g_{n}\) satisfying

$$\mathop {\lim }\limits_{n \to \infty } EL_{n} = \mathop {\lim }\limits_{n \to \infty } P(g_{n} (X,C,D_{n} ) \ne Y) = L^{*},$$

where \(L^{*}\) denotes the Bayes risk and \(C\) is the randomness introduced in the training. Besides, call \(g_{n}\) is strongly consistent if \(g_{n}\) satisfying

$$P(\mathop {\lim }\limits_{n \to \infty } L_{n} = L^{*}) = 1.$$

Definition 3.3

A sequence of classifiers \(\{ g_{n} \}\) is called weakly (strongly) universally consistent if it is weakly(strongly) consistent for all distributions.

Obviously, the condition of strong consistency is stronger than weak consistency, so strong consistency can derive weak consistency, but vice versa is not necessarily true.

Here are some important lemmas that will be used in the proof.

Lemma 3.1

Assume that the classifier sequence \(\{ g_{n} \}\) is (universally) strongly consistent, then the majority voting classifier \(\overline{g}^{(M)}_{n}\) (for any positive integer \(M\)) is also (universally) strongly consistent.

Lemma 3.2

Assume that the classifier sequence \(\{ g_{n} \}\) is strongly consistent, the bagging majority voting classifier \(\overline{g}^{M}_{n}\) (for any positive integer \(M\)) is also strongly consistent if \(\mathop {\lim }\limits_{n \to \infty } nq = \infty\).

Lemma 3.2 is quoted from Theorem 6 [20]. Refer to [20] for more details.

Lemma 3.1 shows that to prove an ensemble classifier with strong consistency, we only need to prove that its base classifier has strong consistency. The universal strong consistency of ensemble classifiers is obtained from the universal strong consistency of base classifiers. Lemma 3.2 can be regarded as a corollary of Lemma 3.1, which shows that the use of bootstrapping does not affect the consistency of the ensemble algorithm. It is worth noting that Lemma 3.1 alone (without bootstrapping) is sufficient to prove the strong consistency of the DMRF. However, using the whole dataset as the training set to build trees will lead to excessive computational costs and high costs with large sample sizes. Moreover, the similarity among the trees will significantly affect the performance of the algorithm. Therefore, we introduce Lemma 3.2, which adds bootstrapping to reduce the training cost while reducing the similarity between trees.

The strong consistency of a single tree is proved below.

Lemma 3.3

Let \(g_{n}\) be a binary tree classifier (that is, a parent node has only two child nodes) obtained by the \(n\)- sample partitioning rule \(\pi_{n}\), whose each region contains at least \(k_{n}\) points, and \(k_{n} /\log n \to \infty (n \to \infty )\), \(A_{n} (x)\) is the unique cell where the sample \(x\) falls into, \(\mu ( \cdot )\) is the Lebesgue measure of \(\cdot\). For all balls \(S_{r}\) with radius \(r\) centered at the origin and for all \(\gamma > 0\), with probability 1 for all distributions satisfying

$$\mathop {\lim }\limits_{n \to \infty } \mu (\{ x:diam(A_{n} (x) \cap S_{r} ) > \gamma \} ) = 0,$$

then \(g_{n}\) corresponding to \(\pi_{n}\) satisfies

$$\mathop {\lim }\limits_{n \to \infty } L(g_{n} ) = L^{*}$$

with probability one. In other words, \(g_{n}\) is universally strongly consistent.

Lemma 3.3 is quoted from Theorem 21.2 and Theorem 21.8 [11], refer to [11] for more details. Lemma 3.3 shows that to prove strong consistency of a tree, we only need to prove that any leaf node is small enough, but the sample size in leaf node is large enough.

Based on the above lemmas, the strong consistency theorem of DMRF algorithm can be obtained:

Theorem 3.1

Assume that \(X\) is supported on \([0,1]\)\(^{D}\) and has non-zero density almost everywhere, the cumulative distribution function (CDF) of the splitting points is left-continuous at 1 and right-continuous at 0. If \(B_{1}\)\(B_{2}\) both positive and finite, DMRF is strongly consistent with probability 1 when \(k_{n} /\log n \to \infty\) and \(k_{n} /n \to 0\) as \(n \to \infty\).

Regression DMRF

In the last section we discussed the DMRF algorithm for classification, in this section we will discuss the DMRF algorithm for regression.

Regression DMRF Algorithm

In classification problems, we choose the Gini index to compute the impurity reduction, while in regression problems, we choose mean squared error (MSE) reduction as the metric for measuring the importance of features and feature values.

Denote the MSE of node \({\mathcal{D}}\) as

$$MSE({\mathcal{D}}) = \frac{1}{{N({\mathcal{D}})}}\sum\limits_{{(X,Y) \in {\mathcal{D}}}} {(Y - \overline{Y})^{2} } ,$$
(4)

where \(\overline{Y} = \frac{1}{{N({\mathcal{D}})}}\sum\limits_{{(X,Y) \in {\mathcal{D}}}} Y\), i.e., the mean of the samples in this node; \(N({\mathcal{D}})\) is the sample size of node \({\mathcal{D}}\). Similar to the classification, when the split point is \(v_{ij}\), the MSE reduction is

$$I_{ij} = I({\mathcal{D}},v_{ij} ) = MSE({\mathcal{D}}) - MSE({\mathcal{D}}^{l} ) - MSE({\mathcal{D}}^{r} ),$$
(5)

where \({\mathcal{D}}^{l}\), \({\mathcal{D}}^{r}\) denote the left and right child node of \({\mathcal{D}}\) splitted by \(v_{ij}\).

When making prediction, the predicted value of the tree is the sample mean of the leaf node \(A_{n} (x)\) (where sample \(x\) resides), in other words,

$$\hat{y}(x) = \frac{{\sum\nolimits_{i = 1}^{n} {Y_{i} {\mathcal{I}}(X_{i} \in A_{n} (x))} }}{{\sum\nolimits_{i = 1}^{n} {{\mathcal{I}}(X_{i} \in A_{n} (x))} }} = \frac{1}{{N(A_{n} (x))}}\sum\limits_{{(X,Y) \in A_{n} (x)}} Y ,$$
(6)

where \(N(A_{n} (x))\) denotes the sample size of \(A_{n} (x)\). The prediction of the forest is the mean of trees, that is

$$\overline{\hat{y}} = \frac{1}{M}\sum\limits_{i = 1}^{M} {\hat{y}^{(i)} (x)} ,$$
(7)

where \(M\) denotes the tree number of the forest, \(\hat{y}^{(i)} (x)\) is the prediction of \(i\)-th tree towards \(x\).

The difference between the regression DMRF and the classification DMRF lies only in the difference in the splitting criteria for the splitting point and the prediction method. To obtain the regression DMRF, we only need to change the impurity reduction criterion to MSE reduction and the majority voting prediction to mean prediction in the classification DMRF.

Strong consistency proof of regression DMRF

For a regressor sequence \(\{ f_{n} \}\), the regressor \(f_{n}\) is obtained by training the data set \(D_{n} = \{ (X_{1} ,Y_{1} ),(X_{2} ,Y_{2} ),...,(X_{n} ,Y_{n} )\}\) which satisfying the distribution \((X,Y)\), the MSE of the \(f_{n}\) is

$$R(f_{n} |D_{n} ) = E[(f_{n} (X,C,D_{n} ) - f(X))^{2} |D_{n} ],$$
(8)

where \(C\) is the randomness introduced in the training.

Similar to the classification case, let's first define the strong consistency of a regression problem.

Definition 3.4

Given the training set \(D_{n}\) which contain \(n\) i.i.d observations, for a certain distribution \((X,Y)\), a sequence of regressors \(\{ f_{n} \}\) is called weakly consistent if \(f_{n}\) satisfying.

$$\mathop {\lim }\limits_{n \to \infty } E[R(f_{n} |D_{n} )] = \mathop {\lim }\limits_{n \to \infty } E[(f_{n} (X,C,D_{n} ) - f(X))^{2} ] = 0,$$

where \(f(X) = E[Y|X]\) and \(C\) is the randomness introduced in the training. \(\{ f_{n} \}\) is called strongly consistent if \(f_{n}\) satisfying

$$\mathop {\lim }\limits_{n \to \infty } R(f_{n} |D_{n} ) = \mathop {\lim }\limits_{n \to \infty } E[(f_{n} (X,C,D_{n} ) - f(X))^{2} |D_{n} ] = 0$$

with probability one.

Definition 3.5

A sequence of regressors \(\{ f_{n} \}\) is called weakly (strongly) universally consistent if it is weakly (strongly) consistent for all distributions of \((X,Y)\) with \(EY^{2} < \infty\).

Lemma 3.4

Assume that the regressor sequence \(\{ f_{n} \}\) is (universally) strongly consistent, then the averaged regressor \(\overline{f}_{n}^{(M)}\) (for any positive integer \(M\)) is also (universally) strongly consistent.

Lemma 3.5

Assume that the regressor sequence \(\{ f_{n} \}\) is strongly consistent, the bagging averaged regressor \(\overline{f}_{n}^{(M)}\) (for any positive integer \(M\)) is also strongly consistent if \(\mathop {\lim }\limits_{n \to \infty } nq = \infty\).

Lemma 3.4 states that if we want to prove a regression ensemble has strong consistency, we only need to prove that its base regressors have strong consistency. Lemma 3.5 is a corollary of Lemma 3.4 and is similar to the classification case. Bootstrapping is not theoretically necessary but is introduced to reduce computational costs and improve algorithm performance. To prove the consistency of the regression, Lemma 3.4 is sufficient.

Lemma 3.6

Let \(P_{n} = \{ A_{n,1} ,A_{n,2} ,...\}\) be a partition of \(R^{d}\) and for each \(x \in R^{d}\) let \(A_{n} (x)\) denote the cell of \(P_{n}\) containing \(x\). Assume for any sphere \(S\) centered at the origin.

$$\mathop {\lim }\limits_{n \to \infty } \mathop {\max }\limits_{{A_{n,j} \cap S \ne \emptyset }} diam(A_{n,j} ) = 0$$

and

$$\mathop {\lim }\limits_{n \to \infty } \frac{{|\{ j:A_{n,j} \cap S \ne \emptyset \} |\log n}}{n} = 0,$$

then the regressor

$$m^{\prime}_{n} = \left\{ \begin{gathered} \frac{{\sum\nolimits_{i = 1}^{n} {Y_{i} {\mathcal{I}}(X_{i} \in A_{n} (x))} }}{{\sum\nolimits_{i = 1}^{n} {{\mathcal{I}}(X_{i} \in A_{n} (x))} }},\sum\nolimits_{i = 1}^{n} {{\mathcal{I}}(X_{i} \in A_{n} (x))} > \log n \hfill \\ 0,{\text{otherwise}} \hfill \\ \end{gathered} \right.$$

is strongly universally consistent.

Lemma 3.6 is quoted from Theorem 23.2 [11], one can refer to [11] for more details.

Theorem 3.2

Assume that \(X\) is supported on \([0,1]^{D}\) and has non-zero density almost everywhere, the cumulative distribution function (CDF) of the split points is left-continuous at 1 and right-continuous at 0. If \(B_{1}\)\(B_{2}\) both positive and finite, DMRF is strongly consistent with probability 1 when \(k_{n} /\log n \to \infty\) and \(k_{n} /n \to 0\) as \(n \to \infty\).

Experiment

For the sake of convenience in narration, we use "(SE)" to indicate that the algorithm is from the original paper, like Denil14(SE), BRF(SE) and MRF(SE). We use "(b)" to indicate the algorithm which use the bootstrapping defined in this paper and without separating the structural part and the estimation part, like Denil14(b), BRF(b), MRF(b).

The experiments are divided into three parts: performance test, standard deviation analysis and parameter test. Performance test evaluates the performance of DMRF in classification and regression problems and compares it with three other consistent RF variants (both weakly and strongly consistent), as well as BreimanRF, to demonstrate DMRF’s performance. Standard deviation analysis section evaluates the standard deviation of RF variants in classification and regression problems to measure the randomness of different RFs. Parameter test discusses the impact of hyper-parameters \(p\)\(q\)\(B_{1}\)\(B_{2}\) on the performance of DMRF and provides some recommendations for selecting optimal parameters.

Dataset selection

Our data sets are all from UCI database. Tables 1 and 2 contain the sample number and feature number of classification and regression data sets respectively, and the classification data set also contains the number of class. In the two tables, we sort the data sets according to the sample number and test the datasets which cover wide range of sample size and feature dimensions in order to show the performance of DMRF. In addition, refer to [14], for missing values of all data sets, we use “-1” padding operation, and no other preprocessing was performed.

Table 1 The description of benchmark classification datasets
Table 2 The description of benchmark regression datasets

Baselines

We choose three proposed RF variants with weak consistency, Denil14(SE), BRF(SE) and MRF(SE), as the comparison model of DMRF. Their common feature is that the dataset is divided into the structure part and the estimation part according to the hyper-parameter \(Ratio\), the structure part is used for split points training and the estimation part for leaf node labels determination.

  1. 1)

    Denil14(SE) randomly selects \(m\) points of the structure part at each node, then selects the feature subspace with the size of \(\min (1 + Poisson(\lambda ),D)\) without replacing, searches for the optimal splitting point within the range defined by the \(m\) points preselected (not the entire number of data points).

  2. 2)

    BRF(SE) introduces the first Bernoulli distribution when selecting feature subspace, that means, a feature is randomly selected from the feature set with the probability of \(p_{1}\) as the split feature, or \(\sqrt D\) features are randomly selected from the feature set with the probability of \(1 - p_{1}\) as the candidate feature. The second Bernoulli distribution is introduced in the selection of split values, that means, a value is randomly selected as the split value from the split features with the probability of \(p_{2}\), or the value with the probability of \(1 - p_{2}\) is selected from the split features with the largest impurity reduction.

  3. 3)

    MRF(SE) normalize the vector composed of the maximum impurity reduction of each feature when selecting the splitting feature, and convert it into probabilities using softmax function, which is used as multinomial distribution to randomly select the splitting feature. The impurity reduction form a vector corresponding to each value of the obtained splitting feature, normalize and convert this vector into probabilities by softmax function, which is used as multinomial distribution to randomly select the splitting value.

According to our method, we can abandon the separation of the structural part and the estimation part in the three models mentioned above. At the same time, we can add the bootstrapping method defined earlier, they can be Denil14(b), BRF(b) and MRF(b). The experiments will examine the performance of these three models in improving data utilization.

Performance test experimental settings

In the performance experiment, in the above three models, i.e. Denil14(SE), BRF(SE) and MRF(SE), we set \(R{\text{atio = 0}}{.5}\) uniformly. Besides, we set \(k_{n} = 5\) and \(M = 100\) according to [14] (The purpose of setting \(k_{n} = 5\) uniformly in this context is to promote extensive tree growth. As long as different algorithms use the same value of \(k_{n}\) for a given data set, it ensures comparability). In Denil14, we set \(m = 100\)\(\lambda = 10\) according to [14]. Following [13], we set \(p_{1} = p_{2} = 0.05\). As [14] suggested, in MRF, \(B_{1} = B_{2} = 5\) (It should be noted that in this paper, we use \(B_{1}\) and \(B_{2}\) to compute probabilities, while in [14], the authors use \(B_{1} /2\) and \(B_{2} /2\), they recommend 10 for both \(B_{1}\) and \(B_{2}\), so we set \(B_{1} = B_{2} = 5\)). In DMRF, we choose \(q = 1 - 1/e\), \(p = 0.5\), \(B_{1} = B_{2} = 5.\)

Standard deviation analysis

The parameters used in the analysis of standard deviation is the same as those used in performance experiment.

Parameter test experimental settings

In the parameter testing experiment, we explore the influence of hyper-parameters on DMRF. We focus on \(p\)\(q\)\(B_{1}\)\(B_{2}\). For \(p\) and \(q\), the test range we take is [0.05, 0.95] with step size 0.1. For \(B_{1}\)\(B_{2}\) the test range is the integer in \([1,10]\). In terms of dataset size, we define datasets with less than 500 samples as small, datasets with 500–1000 samples as medium, and datasets with more than 1000 samples as large. For classification problems, accuracy is used as the evaluation metric. For regression problems, negative mean squared error (NMSE) is used as the evaluation metric for ease of observation.

Results and discussion

Performance analysis

In the RF variants, the best performing result in the table is highlighted in bold. To compare the performance of DMRF and BreimanRF, we use "*" to indicate that which is better.

Classification

The evaluation standard of classification problem is accuracy.

From Table 3, the following conclusions can be drawn:

  • In the majority of cases, the (b)-type models show higher accuracy compared to their corresponding (SE)-type models (for example, MRF(b) achieves a 9% higher accuracy than MRF(SE) on the Winequality (white) dataset). This suggests that when the splitting criterion and leaf node label determination process are independent, a significant amount of information may be lost. Determining the leaf node labels based on the samples used to compute the splitting point helps reduce information loss.

  • In all datasets, DMRF generally outperforms other RF variations, and the advantage of DMRF over MRF(b) is particularly evident with an improvement of 1% observed on some datasets (such as Obesity and Ai4i).

  • In most cases, the accuracy of DMRF is higher than BreimanRF. This can be attributed to the use of the multinomial distribution for randomly sampling splitting values can be seen as a weakened version of optimal splitting, which enhances robustness. It indicates that introducing some randomness in classification tasks can enhance performance.

Table 3 Accuracy (%) of different RFs on benchmark datasets

Regression

The evaluation criterion of regression problem is mean square error.

From Table 4, the following conclusions can be drawn:

  • Similar to the classification case, in the majority of cases, the (b)-type models outperform their corresponding (SE)-type models (for example, there is an 48% reduction in MSE on the Real estate dataset). This suggests that determining the leaf node labels based on the samples used to compute the splitting point can better utilize information compared to the independent processes.

  • In all datasets, DMRF generally shows the best performance among RF variations, but the advantage of DMRF over MRF(b) is not significant.

  • In most cases, the MSE of DMRF is larger than that of BreimanRF. This is because MSE amplifies the impact of noise, and the randomness introduced by the multinomial distribution makes DMRF perform worse in regression compared to BreimanRF. In contrast, DMRF is better suited for classification tasks.

Table 4 Mean square error(%) of different RFs on benchmark datasets

Standard deviation analysis

Given the inherent randomness in the models used for the experiments, it is essential to compare their levels of randomness. In this case, we will evaluate the randomness using the standard deviation as a metric.

Table 5 shows the standard deviations of 10-fold cross-validation results computed 10 times for 8 classification datasets and 8 regression datasets. It can be observed that, in both classification and regression task, in most cases, the (b)-type models show larger standard deviations compared to their corresponding (SE)-type models (for example, MRF(b) has a larger standard deviation than MRF(SE) on the Blogger and ALE datasets). This indicates that, under similar conditions, using bootstrapping to get training sets introduces greater randomness compared to divide the dataset into structure part and estimation part.

Table 5 Standard deviation of different RFs on classification and regression datasets

Furthermore, whether in classification or regression, in cases with small and medium sample sizes, DMRF tends to be more stable than MRF(b) and BreimanRF (for example, on the Wdbc and Las Vegas Strip datasets). However, with large sample sizes, DMRF exhibits higher randomness compared to MRF(b) and BreimanRF (for example, on the Connect-4 and Combined datasets). This can be attributed to the introduction of Bernoulli and multinomial distributions in the process of finding splitting points in DMRF, which results in higher levels of randomness compared to MRF(b) and BreimanRF. Under small sample size cases, adding appropriate randomness helps increase the robustness of the model. However, in large sample size cases, the difference in randomness is amplified, leading to higher standard deviations for DMRF compared to MRF(b) and BreimanRF.

Parameter analysis

In this section, we explore the influence of hyper-parameters on DMRF.

The effect of \(p\), \(q\)

We investigate \(p\)\(q\) under \(B_{1} = B_{2} = 5\) as [14] recommended.

Classification

Figure 1 shows the performance of DMRF on three classification datasets with small, medium, and large sample sizes under \(B_{1} = B_{2} = 5\). It can be observed that for the same \(p\), the accuracy of three datasets at \(q = 0.63\) has two situations: close to maximum and stable, or decreasing. Through analysis, it is known that when \(q\) is too small, the sampling probability of each sample is too low, resulting in insufficient training of trees; as \(q\) increases, the number of sampled samples increases, and the training of trees gradually becomes sufficient, resulting in the performance of DMRF increasing; However, when the number of samples reaches a certain value, the performance improvement slows down, and there will be a situation where the accuracy tends to be stable or even starts to decrease. The reason for the decrease is that the number of samples taken exceeds an appropriate value, resulting in high similarity between the training sets of trees, which affects the overall performance. Since the optimal value of \(q\) is around 0.63, we set \(q = 1 - 1/e( \approx 0.6322)\) to reduce computational burden while obtaining the optimal parameter.

Fig. 1
figure 1

Accuracy (%) of the DMRF under different p, q values

It is worth noting that the bootstrapping used in this paper can be seen as a general form of the non-repeated resampling standard bootstrapping in the case of large samples. In fact, if we take \(n\) non-repeated samples of an \(n\)-sample dataset with equal probability, the probability of each sample being selected is \(1 - (1 - 1/n)^{n} \to 1 - 1/e(n \to \infty )\). This is also one of the reasons why we chose the \(q = 1 - 1/e\).

Under the same \(q\), it can be observed that the accuracy increases initially with an increase in \(p\) and then decreases, reaching an optimal value around \(p = 0.5\). In DMRF, if \(p = 0\), the selection of splitting points at each node depends on the random selection of two multinomial distributions. If \(p = 1\), DMRF selects the optimal split point from the feature subspace, which is similar to BreimanRF. It can be seen that introducing some randomness when selecting split points in the feature subspace can enhance performance. Additionally, we can conclude that the algorithm is more sensitive to the parameter \(q\) compared to \(p\), indicating that the number of training samples is more important than the method of split point selection.

Regression

Figure 2 shows the performance of DMRF on three regression datasets with small, medium, and large sample sizes under \(B_{1} = B_{2} = 5\). It can be observed that the regression situation is similar to the classification: under the same \(p\), the NMSE increases with the increase of \(q\) and reaches its maximum at around 0.63 before stabilizing or beginning to decrease. Under the same \(q\), the NMSE initially increases with the increase of \(p\), then decreases or stabilizes after reaching a certain point. Therefore, the optimal \(q\) is considered to be \(1 - 1/e\) and the optimal \(p\) value is considered to be 0.5. Additionally, it's still observed that the DMRF algorithm is more sensitive to parameter \(q\) than to parameter \(p\).

Fig. 2
figure 2

Negative mean square error of the DMRF under different p, q values

The effect of \(B_{1}\), \(B_{2}\)

Classification

Figure 3 shows the impact of \(B_{1}\) and \(B_{2}\) on the DMRF algorithm under \(q = 1 - 1/e\) and \(p = 0.5\) for three classification datasets with small, medium, and large sample sizes. Under the same \(B_{2}\), the accuracy increases slightly as \(B_{1}\) increases from 1, reaching a stable or top point at around \(B_{1} = 5\). Under the same \(B_{1}\), the accuracy starts to increase as \(B_{2}\) increases from 1 and stabilizes or be the top at around \(B_{2} = 5\). The reason is that when \(B_{1}\) is close to 1, the probabilities of each feature are not significantly different from each other, making it difficult to sample the optimal features. As \(B_{1}\) grows, the differences in probabilities between features become larger, and more important features tend to be selected, improving the DMRF's performance. However, when \(B_{1}\) grows to a certain extent, the selection of features becomes similar to selecting the optimal feature, leading to the similarity between base decision trees being too high and causing a decrease in performance. The situation for \(B_{2}\) is similar to that of \(B_{1}\).

Fig. 3
figure 3

Accuracy (%) of the DMRF under different B1, B2 values

It can also be seen from the figure that the DMRF is not sensitive to \(B_{2}\) but is more sensitive to \(B_{1}\). Since \(B_{1}\) affects the selection of splitting features and \(B_{2}\) affects the selection of split values, this can indicate that the impact of the splitting feature is greater. This also makes sense because the splitting value is obtained based on the splitting feature, so in general, the influence of the splitting feature is larger compared to the splitting value.

Regression

Figure 4 shows the impact of \(B_{1}\) and \(B_{2}\) on the DMRF algorithm under \(q = 1 - 1/e\) and \(p = 0.5\) for three regression datasets with small, medium, and large sample sizes. Unlike the classification case, under the same \(B_{2}\), the NMSE increases as \(B_{1}\) decreases from 10 in general. After reaching around \(B_{1} = 5\), NMSE stabilizes or starts to decrease. Under the same \(B_{1}\), NMSE starts to increase as \(B_{2}\) increases from 1 and stabilizes or slightly decreases at around \(B_{2} = 5\). The reason is that when \(B_{1}\) is too large, the probability of selecting the optimal feature is much higher than that of other features, resulting in high similarity between trees and poor performance. As \(B_{1}\) decreases, more randomness is introduced, preserving the high performance of trees while increasing diversity, improving the algorithm's performance. When \(B_{1}\) is close to 1, the probabilities of each feature are not significantly different from each other, making it difficult to sample optimal features, leading to the poor performance of trees and the algorithm.

Fig. 4
figure 4

Negative mean square error of the DMRF under different B1, B2 values

The situation for \(B_{2}\) is similar to that of \(B_{1}\). Futhermore, it can be observed from the Fig. 4. that DMRF is more sensitive to \(B_{1}\) than \(B_{2}\), which is similar to that of the classification. Since \(B_{1}\) affects the selection of splitting features and \(B_{2}\) affects the selection of split values, it can be concluded that the impact of the splitting feature is greater.

The effect of Μ

The left side of Fig. 5. Shows the performance trends of the Blogger, Tic-tac-toe, and Winequality (white) datasets in terms of accuracy as the number of trees (i.e.\(M\)) increases in classification. The right side shows the performance trends of the Real estate, Concrete, and Combined datasets in terms of mean squared error (MSE) as \(M\) increases in regression.

Fig. 5
figure 5

Performance of DMRF under different number of trees

In the classification case, it can be observed that as \(M\) increases, DMRF shows an upward trend in accuracy for all three datasets and gradually converges after reaching 100 trees. Similarly, in the regression case, as \(M\) increases, DMRF shows a downward trend in MSE for all three datasets and gradually converges after reaching 100 trees.

This demonstrates that DMRF, as a method of Bagging, improves its performance gradually with an increasing number of base learners until convergence.

The effect of \(k_{n}\)

Figure 6 shows the impact of \(k_{n}\) on the DMRF under \(q = 1 - 1/e\), \(p = 0.5\), \(B_{1} = B_{2} = 5\) for three classification datasets (the top three plots) and three regression datasets (the bottom three plots). \(r\) in the Fig. 6 represents the proportion of samples in the selected dataset to enter training. For example, the sample size of Letter is 20000, and the sample size of training is 6000 when \(r = 0.3\)

Fig. 6
figure 6

Performance of DMRF under different \(k_{n}\)

.

It can be easily proven that \(\sqrt n - (\log n)^{2}\) is an increasing function with respect to \(n\), and \(\sqrt n \ > (\log n)^{2}\) when \( n \ > 5600\). Therefore, the performance of DMRF under \(k_{n} = \sqrt n\) will be better than DMRF under \(k_{n} = (\log n)^{2}\) when \( n \ > 5600\). This can also be observed from Fig. 6, for example, in the Letter, \(k_{n} = (\log n)^{2}\) outperforms \(k_{n} = \sqrt n\) starting from \(r = 0.3\) (At this point, the number of samples involved in the training is 6000), and in the Cbm \(k_{n} = (\log n)^{2}\), performs better than \(k_{n} = \sqrt n\) starting from \(r = 0.5\) (At this point, the number of samples involved in the training is 5967). This demonstrates that as \(k_{n}\) decreases, the performance of DMRF improves. Since \(k_{n}\) represents the minimum sample size of leaf nodes, a smaller \(k_{n}\) indicates more sufficient tree growth, which theoretically leads to better performance. The experimental results are consistent with the theoretical derivation.

Cross-validation

In this section, we use cross validation to determine the optimal parameters for DMRF, MRF(b), BRF(b), and Denil14(b) on various classification and regression datasets. For all models, set \(q \in \{ 0.2,0.4,1 - 1/e,0.8\}\). For DMRF and MRF(b), set \(B_{1}\), \(B_{2} \in \{ 2,5,8,10\}\),\(p \in \{ 0.1,0.3,0.5,0.8\}\). For BRF(b), set \(p_{1} ,p_{2} \in \{ 0.01,0.05,0.35,0.65\}\). For Denil14(b), set \(\lambda \in \{ 1,5,10,15\}\), \(m \in \{ 50,100,200,500\}\).

In both classification and regression, comparing the performance of DMRF with MRF(b), BRF(b), and Denil14(b) under the optimal parameters, it can be seen from the Table 6 (The bold fonts in Table 6 show the models that work best) that DMRF outperforms the other models. Compared to the MRF(b), DMRF shows improvements in performance. The reason can be analyzed as follows: Although MRF(b) uses the softmax function to convert features and feature importance into probabilities and samples splitting points using a multinomial distribution, it considers the entire feature space. The higher the importance of a feature or feature value, the higher its probability. Additionally, the softmax function amplifies the differences between features or feature values. Therefore, the probability of sampling the optimal feature and optimal feature value remains the highest. This can be considered as a weakened version of optimal splits in the full feature space. Although this approach improves performance, it reduces the diversity among trees.

Table 6 Results of cross-validation of different RFs

On the other hand, DMRF selects optimal splits in the feature subspace based on probabilities and performs multinomial distribution sampling, which increases the diversity among trees and thus improves performance (Table 7).

Table 7 Computational complexity of RFs

Computational complexity analysis

Assume that the data set has \(n\) samples and \(D\) features, we prepare to build \(M\) trees. The complexity of random sampling is not considered below.

The best case for tree construction is complete balanced growth, in this case, the depth of the tree is \({\mathcal{O}}(\log n)\). All samples of each layer of DMRF are involved in the calculation, and the number of features calculated is \(\sqrt D\). Therefore, the complexity of building a DMRF tree is \({\mathcal{O}}(\sqrt D n\log n)\), so the complexity of DMRF is \({\mathcal{O}}\left( {\sqrt D nM\log n} \right)\).

In the same way, the complexity of BreimanRF is \({\mathcal{O}}(\sqrt D nM\log n)\). The feature subspace size of Denil14 is \(\min (1 + Poisson(\lambda ),D)\), and the optimal split point is searched in the pre-selected samples of \(m(m < n)\) samples, so its complexity is \({\mathcal{O}}(\min (1 + Poisson(\lambda ),D) \cdot mM\log n)\). BRF introduces two Bernoulli distributions when choosing split points, the average number of features calculated at each layer is \(p_{1} + (1 - p_{1} )\sqrt D\), and the average number of samples calculated at each layer is \((1 - p_{2} )n\), so the complexity is \({\mathcal{O}}((p_{1} + (1 - p_{1} )\sqrt D )(1 - p_{2} )nM\log n)\). MRF introduces two multinomial distributions when selecting split points, and all features and samples at each node are involved in the calculation, so the complexity is \({\mathcal{O}}(DnM\log n)\).

From Table 5, it can be seen that MRF(b) has the highest complexity, followed by DMRF and BreimanRF. Due to the sampling process involved in selecting split points, DMRF will generally take slightly longer than BreimanRF in most cases. Due to the generally small values of \(p_{1}\)\(p_{2}\), in most cases, the complexity of BRF is slightly lower than that of DMRF and BreimanRF. As for Denil14(b), the complexity ranking is determined by the values of \(\lambda\), \(m\).

Figure 7 shows the running time of one iteration of cross-validation for three classification datasets (Tic-tac-toe, Winequality(white), Connect-4) and three regression datasets (Alcohol, Flare, Insurance) under parameters mentioned in Sect. 4.3. It can be observed that BRF(b) has shorter runtime in both classification and regression tasks, while MRF(b) has longer runtime in both tasks. As we analyzed earlier, in most cases, BreimanRF has shorter runtime compared to DMRF.

Fig. 7
figure 7

Running time in one iteration of cross-validation for different models

It is worth noting that in Connect-4, a big dataset for classification, Denil14(b) has longer runtime compared to DMRF, BRF(b), and BreimanRF, while in Insurance, a big dataset for regression,Denil14(b) has shorter runtime compared to DMRF, BRF(b), and BreimanRF. This is because Connect-4 has 42 features, and the average size of Denil14(b)'s feature subspace is 11, while the average size of feature subspaces for DMRF, BRF(b), and BreimanRF is 6. Although Denil14(b) only selects 100 points to compute split points, the experimental results indicate that the impact of feature subspace size on runtime in Connect-4 is greater than calculating splitting points for 100 samples each time.

In Insurance, the number of features is 86, and the average size of Denil14(b)'s feature subspace is 11, while the average size of feature subspaces for DMRF, BRF(b), and BreimanRF is 9, which is not significantly different. However, because Denil14(b) only selects 100 points to compute splitting points, it has a faster speed.

Conclusions and future works

The main contributions of this paper are as follows:

By modifying the condition for the number of samples in leaf nodes, the weak consistency proofs in previous RF variants have been improved to strong consistency proofs. The previously proposed weak consistency models, such as Denil14(SE), BRF(SE), and MRF(SE), have been enhanced to models with strong consistency in probability, namely Denil14(b), BRF(b), and MRF(b).

We introduces a novel algorithm called DMRF, which combines Bernoulli and multinomial distributions. DMRF utilizes a modified bootstrapping to obtain training sets for base trees and uses the combination of Bernoulli and multinomial distributions to determine the splitting points during tree construction. This approach increases diversity while maintaining high-performance. Besides, We discuss the parameters involved in DMRF and validated their impact on DMRF through experiments. We also provide recommend values for these parameters based on our findings.

The experiments indicate that DMRF outperforms MRF(b) and BreimanRF in classification tasks. However, in regression tasks, DMRF performs better than MRF(b) but the difference is not significant. In most cases, DMRF's performance is not as good as BreimanRF, suggesting that DMRF is more suitable for classification tasks.

In terms of standard deviation, DMRF has lower standard deviation than MRF(b) and BreimanRF on small and medium-sized datasets. However, on large datasets, DMRF is more likely to have a higher standard deviation compared to MRF(b) and BreimanRF. In terms of time complexity, DMRF has the same complexity as BreimanRF. The complexity of BRF(b) and Denil14(b) is determined by the parameter settings, while MRF(b) has the highest complexity.

The main advantages of DMRF lie in its strong theoretical properties, excellent performance, and lower complexity(same as BreimanRF). It shows clear advantages on small sample datasets. However, one limitation is that DMRF shows higher randomness than BreimanRF on large dataset. Future research can focus on addressing this limitation of increased randomness in DMRF for large sample cases.

Availability of data and materials

The experimental data are available at https://archive.ics.uci.edu/ml/index.php.

References

  1. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  2. A Bifet, G Holmes, B Pfahringer, R Kirkby, R Gavaldà, New ensemble methods for evolving data streams, in: ACM SIGKDD. 2009;139–148.

  3. C Xiong, D Johnson, R Xu, JJ Corso, Random forests for metric learning with implicit pairwise position dependence, in: ACM SIGKDD. 2012;958–966.

  4. Li Y, Bai J, Li J, Yang X, Jiang Y, Xia S-T. Rectified decision trees: exploring the landscape of interpretable and effective machine learning. arXiv. 2020. https://doi.org/10.48550/arXiv.2008.09413.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Cootes TF, Ionita MC, Lindner C, Sauer P. Robust and accurate shape model fitting using random forest regression voting. Berlin, Heidelberg: Springer; 2012. p. 278–91.

    Google Scholar 

  6. P Kontschieder, M Fiterau, A Criminisi, S Rota Bulo, Deep neural decision forests, in: ICCV, 2015:1467–1475.

  7. Randrianasoa JF, Cettour-Janet P, Kurtz C, Desjardin É, Gançarski P, Bednarek N, Rousseau F, Passat N. Supervised quality evaluation of binary partition trees for object segmentation. Pattern Recognit. 2021. https://doi.org/10.1016/j.patcog.2020.107667.

    Article  Google Scholar 

  8. Prasad AM, Iverson LR, Liaw A. Newer classifification and regression tree techniques: bagging and random forests for ecological prediction. Ecosystems. 2006;9(2):181–99.

    Article  Google Scholar 

  9. Cutler DR, et al. Random forests for classifification in ecology. Ecology. 2007;88(11):2783–92.

    Article  PubMed  Google Scholar 

  10. Acharjee A, Kloosterman B, Visser RG, Maliepaard C. Integration of multi-omics data for prediction of phenotypic traits using random forest. Bioinformatics. 2016;17(5):363–73.

    Google Scholar 

  11. Devroye L, Györfifi L, Lugosi G. A probabilistic theory of pattern recognition, vol. 31. Berlin, Germany: Springer; 2013.

    Google Scholar 

  12. M Denil, D Matheson, N De Freitas, Narrowing the gap: random forests in theory and in practice, in: ICML, 2014;665–673.

  13. Wang Y, Xia S-T, Tang Q, Wu J, Zhu X. A novel consistent random forest framework: bernoulli random forests. IEEE Trans Neural Netw Learn Syst. 2017;29(8):3510–23.

    MathSciNet  PubMed  Google Scholar 

  14. Bai J, Li Y, Li J, Yang X, Jiang Y, Xia S-T. Multinomial random forest. Pattern Recognit. 2022. https://doi.org/10.1016/j.patcog.2021.108331.

    Article  Google Scholar 

  15. Rodriguez JJ, Kuncheva LI, Alonso CJ. Rotation forest: a new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell. 2006;28(10):1619–30.

    Article  PubMed  Google Scholar 

  16. N Meinshausen. Quantile regression forests. J Machine Learn Res. 2006;983–999.

  17. Menze BH, Kelm BM, Splitthoff DN, Koethe U, Hamprecht FA. On oblique random forests. In: Hofmann T, Malerba D, Vazirgiannis M, Gunopulos D, editors. Machine learning and knowledge discovery in databases. Berlin Heidelberg: Springer Berlin Heidelberg; 2011. p. 453–69.

    Chapter  Google Scholar 

  18. Z-H Zhou, J Feng. Deep forest: towards an alternative to deep neural networks. IJCAI. 2017;3553–3559.

  19. Biau G, Scornet E, Welbl J. Neural random forests. Sankhya A. 2019;81:347–86.

    Article  MathSciNet  Google Scholar 

  20. Biau G, Devroye L, Lugosi G. Consistency of random forests and other averaging classifiers. J Mach Learn Res. 2008;9:2015–33.

    MathSciNet  Google Scholar 

  21. G Biau. Analysis of a random forests model. J Machine Learn Res. 2012;1063–1095.

  22. Györfi L, Kohler M, Krzyzak A, Walk H. A distribution-free theory of nonparametric regression. Berlin, Germany: Springer; 2002.

    Book  Google Scholar 

Download references

Acknowledgements

We would like to thank School of Mathematics, Statistics and Mechanics and the Faculty of Information Technology at Beijing University of Technology for their support of this paper. We also appreciate the valuable comments and suggestions from the editors and reviewers.

Funding

Not applicable.

Author information

Authors and Affiliations

Authors

Contributions

JHC: investigation, methodology, coding, writing—review & editing; FL: supervision, methodology, funding acquisition, review & editing; XLW: supervision, methodology, validation, writing—review & editing; All authors read and approved the final manuscript.

Corresponding authors

Correspondence to JunHao Chen or XueLi Wang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix

The proof of Lemma 3.1

Denote \(g^{*}(x)\) as the Bayes classifier, then the Bayes risk is

$$L^{*} = P(g^{*}(x) \ne Y).$$

Denote.

$$A = \{ k|\gamma^{(k)} (x) = \max \{ \gamma^{(k)} (x)\} \} ,$$
$$B = \{ k|\gamma^{(k)} (x) < \max \{ \gamma^{(k)} (x)\} \} .$$

Then

$$P(\overline{g}_{n}^{(M)} (x,C,D_{n} ) \ne Y|D_{n} )$$
$$= \sum\limits_{k} {P(\overline{g}_{n}^{(M)} (x,C,D_{n} ) = k|D_{n} )\cdot P(Y \ne k|D_{n} )} \, \le L^{*} \cdot \sum\limits_{k \in A} {P(\overline{g}_{n}^{(M)} (x,C,D_{n} ) = k|D_{n} )} + \sum\limits_{k \in B} {P(\overline{g}_{n}^{(M)} (x,C,D_{n} ) = k|D_{n} )},$$

so it is sufficient to prove that the limit of the latter term is 0 for all \(k \in B\).

For \(\forall k \in B\),

$$P(\overline{g}_{n}^{(M)} (x,C,D_{n} ) = c|D_{n} )\, = P(\sum\limits_{i = 1}^{M} {\mathcal{I}} (g_{n} (x,C^{(i)} ,D_{n} ) = k) > \mathop {\max }\limits_{l \ne k} \{ \sum\limits_{i = 1}^{M} {\mathcal{I}} (g_{n} (x,C^{(i)} ,D_{n} ) = l)\} |D_{n} )$$
$$\le P(\sum\limits_{i = 1}^{M} {\mathcal{I}} (g_{n} (x,C^{(i)} ,D_{n} ) = k) \ge 1|D_{n} )$$
$$\le E(\sum\limits_{i = 1}^{M} {\mathcal{I}} (g_{n} (x,C^{(i)} ,D_{n} ) = k)|D_{n} )$$
$$= M{\cdot}P(g_{n} (x,C,D_{n} ) = k)|D_{n} ) \to 0(n \to \infty ).$$

The proof of Lemma 3.4

Every base tree is strongly consistent, i.e.,

\(\mathop {\lim }\limits_{n \to \infty } R(f_{n} |D_{n} ) = \mathop {\lim }\limits_{n \to \infty } E[(f_{n} (X,C^{(i)} ,D_{n} ) - f(X))^{2} |D_{n} ] = 0\)\(i \in \{ 1,2,...,M\}\).

Then

$$R(\overline{f}_{n}^{(M)} |D_{n} )$$
$$= E[(\frac{1}{M}\sum\limits_{i = 1}^{M} {f_{n} (X,C^{(i)} ,D_{n} )} - f(X))^{2} |D_{n} ]$$
$$\mathop \le \limits^{(c)} \frac{1}{M}\sum\limits_{i = 1}^{M} {E[(f_{n} (X,C^{(i)} ,D_{n} ) - f(X))^{2} |D_{n} ]} \to 0(n \to \infty ).$$

where \(C^{(i)}\) is the randomness introduced in \(i\)-th tree building, (c) uses Cauchy inequality.

The proof of Theorem 3.1

First, we proof the number of samples in each leaf node of DMRF tree has at least \(k_{n}\) with probability 1 when \(n \to \infty\).

Due to the randomness of the split point selection, the final selected split point can be regarded as a random variable \(W\), which follows the uniform distribution on [0,1], and its cumulative distribution function is.

$$F_{W} (x) = x,x \in [0,1].$$

For \(\forall m \in N^{ + } ,\varepsilon > 0\) and a certain \(0 < \eta < 1\), the smallest child node after the root node splits according to a splitting feature is denoted as \(M_{1} = \min (W,1 - W)\), then we have

$$P(M_{1} \ge \eta^{1/m} ) = P(\eta^{1/m} \le W \le 1 - \eta^{1/m} )$$
$$= F_{W} (1 - \eta^{1/m} ) - F_{W} (\eta^{1/m} )$$
$$= 1 - \eta^{1/m} - \eta^{1/m}$$
$$= 1 - 2\eta^{1/m} .$$

Without loss of generality, we can normalize the value of all attributes to range [0,1] for each node. If the feature is continuously split \(m\) times (i.e. the tree grows to the \(m\)-th layer), the probability that the smallest child node in \(m\)-th layer has the size at least \(\eta\) is

$$P(M_{m} \ge \eta ) = (1 - 2\eta^{1/m} )^{m} .$$

In this case, if \(0 < \eta < \{ \frac{1}{2}[1 - (1 - \varepsilon )^{\frac{1}{K}} ]\}^{K}\), then

$$P(M_{m} \ge \eta ) = (1 - 2\eta^{1/m} )^{m} > 1 - \varepsilon .$$

The above results are based on the fact that the same feature is selected for each split. In fact, if different features are split at different layers, \(P(M_{m} \ge \eta )\) will be greater, we still have

$$P(M_{m} \ge \eta ) > 1 - \varepsilon .$$

This indicates that the size of each node is \(\eta\) in \(m\)-th layer with the probability at least \(1 - \varepsilon\).

Since \(X\) has a non-zero density function, each node in the \(m\)-th layer of the tree has a positive metric with respect to \(\mu_{X}\). Define.

$$\zeta \, = \,\mathop {\min }\limits_{N:\,a\,leaf\,at\,m - th\,level} \,\mu_{X} \left[ {\cal N} \right],$$

\(\zeta > 0\) because the measure of each leaf node is positive and the number of leaf nodes is finite.

The number of samples in the training set is \(n\), and the number of samples in the leaf node \({\cal N}\) follows \(\mathcal{B}(n, \zeta)\), then

$$P(N({\cal N}) < k_{n} ) = P(N({\cal N}) - n\zeta < k_{n} - n\zeta )$$
$$\mathop = \limits^{(a)} P(|N({\cal N}) - n\zeta | > |k_{n} - n\zeta |)$$
$$\mathop \le \limits^{(b)} \frac{n\zeta (1 - \zeta )}{{|k_{n} - n\zeta |^{2} }}$$
$$= \frac{\zeta (1 - \zeta )}{{n|\frac{{k_{n} }}{n} - \zeta |^{2} }} \to 0(n \to \infty ).$$

(a) is based on the fact that \(k_{n} /n \to 0\) as \(n \to \infty\), so \(k_{n} - n\zeta < 0\) if \(n \to \infty\)\((b)\) uses Chebyshev’s inequality. This suggests that the probability of reaching the stop condition will converge to 0 as \(n \to \infty\), which means that can split infinitely many times with probability 1.

It is sufficient to show that it satisfies the conditions of Lemma 3.3 with probability 1. Obviously, we just need to prove that \(diam(A_{n} (x)) \to 0\) as \(n \to \infty\) with probability 1. Let \(V(i)\) denote the size of the \(i\)-th feature of \(A_{n} (x)\), we only need to show that \(E[V(i)] \to 0\) for all \(i \in \{ 1,2,...,D\}\).

Without loss of generality, at each node, we will scale each feature to [0, 1].

First, we define the following events: \(E_{1} =\){\(i\)-th feature is a candidate feature}, \(E_{2} =\){use optimal split criterion to get splitting point}, \(E_{3} =\){\(i\)-th feature is a splitting feature}. For a given \(i\), denote the largest size among its child nodes as \(V^{*} (i)\).

Let \(W_{i}\) be the position of the splitting point, then \(W_{i} |E_{3} \sim \mathcal{U}(0, 1)\) and.

$$V^{*} (i)|E_{3} = \max (W_{i} |E_{3} ,1 - (W_{i} |E_{3} ))\sim \mathcal{U}[\frac{1}{2},1],$$

so we have

$$E[V^{*} (i)|E_{3} ] = \frac{3}{4}.$$

When \(\overline{E}_{2}\) happens, during the process of selecting the splitting feature, the normalized vector of impurity reduction (denoted as \(\hat{I}\), which is an \(n\)-dimensional vector) is considered. When the i-th element of \(\hat{I}\) is 0 and all other elements are 1, the probability of selecting the i-th feature as the splitting feature is minimized. Therefore,

$$P(E_{3} |\overline{E}_{2} ) \ge \frac{1}{{1 + (\sqrt D - 1){\cdot}e^{{B_{1} }} }}\mathop = \limits^{\Delta } p_{1} ,$$

and

$$P(E_{3} ) = P(E_{2} ){\cdot}P(E_{3} |E_{2} ) + P(\overline{E}_{2} ){\cdot}P(E_{3} |\overline{E}_{2} ) \ge P(\overline{E}_{2} ){\cdot}P(E_{3} |\overline{E}_{2} ) \ge (1 - p)p_{1} .$$

So,

$$E[V(i)|E_{1} ] \le P(E_{3} |E_{1} ) \cdot E[V(i)|E_{1} ,E_{3} ] + P(\overline{E}_{3} |E_{1} ){\cdot}E[V(i)|E_{1} ,\overline{E}_{3} ]$$
$$= P(E_{3} ){\cdot}E[V(i)|E_{3} ] + (1 - P(E_{3} )) \cdot 1$$
$$\le P(E_{3} ) \cdot E[V^{*} (i)|E_{3} ] + 1 - P(E_{3} )$$
$$= P(E_{3} ) \cdot \frac{3}{4} + 1 - P(E_{3} )$$
$$= 1 - \frac{1}{4}P(E_{3} )$$
$$\le 1 - \frac{{(1 - p)p_{1} }}{4}.$$

Thus, it can be inferred that

$$E[V(i)] \le P(E_{1} ) \cdot E[V(i)|E_{1} ] + P(\overline{E}_{1} ) \cdot E[V(i)|\overline{E}_{1} ]$$
$$= \frac{\sqrt D }{D} \cdot E[V(i)|E_{1} ] + (1 - \frac{\sqrt D }{D}) \cdot 1$$
$$\le \frac{1}{\sqrt D } \cdot (1 - \frac{{(1 - p)p_{1} }}{4}) + 1 - \frac{1}{\sqrt D }$$
$$= 1 - \frac{{(1 - p)p_{1} }}{4\sqrt D }$$
$$\mathop = \limits^{\Delta } \,A\,\left( {{\text{denote}}\,A = 1 - \frac{{(1 - p)p_{1} }}{4\sqrt D }} \right).$$

The above process is the result of one time split. If the \(i\)-th feature is splited \(m\) times, the following formula can be obtained by iterating the above formula continuously:

$$E[V(i)] \le A^{m} .$$

We have proven that \(m \to \infty (n \to \infty )\) with probability 1, so the strong consistency of DMRF can be obtained with probability 1.

In summary, \(diam(A_{n} (x)) \to 0(n \to \infty )\) with probability 1, DMRF tree is strongly consistent with probability 1. By lemma 3.2, DMRF algorithm has strong consistency with probability 1.

The proof of Theorem 3.2

By lemma 3.4, the strong consistency of DMRF in regression problem is based on the strong consistency of trees.The following proves the strong consistency of the base regression tree.

By lemma 3.6, if we prove

$$\mathop {\lim }\limits_{n \to \infty } diam(A_{n} (x)) \to 0$$

and

$$\mathop {\lim }\limits_{n \to \infty } \frac{{|\{ j:A_{n,j} \cap S \ne \emptyset \} |\log n}}{n} = 0,$$

then the strong consistency of

$$m^{\prime}_{n} = \left\{ \begin{aligned} \frac{{\sum\nolimits_{i = 1}^{n} {Y_{i} I(X_{i} \in A_{n} (x))} }}{{\sum\nolimits_{i = 1}^{n} {I(X_{i} \in A_{n} (x))} }},&\sum\nolimits_{i = 1}^{n} {I(X_{i} \in A_{n} (x))} > \log n \hfill \\ 0,& \ otherwise \hfill \\ \end{aligned} \right.$$

is obtained. Since the sample number of each cell is at least \(k_{n}\), i.e.,

$$N(A_{n} (x)) = \sum\nolimits_{i = 1}^{n} {I(X_{i} \in A_{n} (x))} \ge k_{n}.$$

From \(k_{n} /\log n \to \infty (n \to \infty )\), when \(n\) is sufficiently large,

$$\sum\nolimits_{i = 1}^{n} {I(X_{i} \in A_{n} (x))} \ge k_{n} > \log n.$$

In this case

$$m^{\prime}_{n} = \hat{y}(x) = \frac{{\sum\nolimits_{i = 1}^{n} {Y_{i} I(X_{i} \in A_{n} (x))} }}{{\sum\nolimits_{i = 1}^{n} {I(X_{i} \in A_{n} (x))} }} = \frac{1}{{N(A_{n} (x))}}\sum\limits_{{(X,Y) \in A_{n} (x)}} Y.$$

That is, the base regression tree is universally strongly consistent. Therefore, we only need to prove that the conditions of the above two limits are true. The former condition has been proved in the consistency proof of classification DMRF algorithm, so only the latter is needed to prove.

For the base tree with \(n\) training samples, there is at most \(\frac{n}{{k_{n} }}\) split regions,

$$\frac{{|\{ j:A_{n,j} \cap S \ne \emptyset \} |\log n}}{n} \le \frac{n}{{k_{n} }} \cdot \frac{\log n}{n} = \frac{\log n}{{k_{n} }} \to 0(n \to \infty ).$$

The latter is true. Therefore, the strong consistency of the regression DMRF algorithm is obtained.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Wang, X. & Lei, F. Data-driven multinomial random forest: a new random forest variant with strong consistency. J Big Data 11, 34 (2024). https://doi.org/10.1186/s40537-023-00874-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s40537-023-00874-6

Keywords