Data-driven multinomial random forest: a new random forest variant with strong consistency

In this paper, we modify the proof methods of some previously weakly consistent variants of random forest into strongly consistent proof methods, and improve the data utilization of these variants in order to obtain better theoretical properties and experimental performance. In addition, we propose the Data-driven Multinomial Random Forest (DMRF) algorithm, which has the same complexity with BreimanRF (proposed by Breiman) while satisfying strong consistency with probability 1. It has better performance in classification and regression tasks than previous RF variants that only satisfy weak consistency, and in most cases even surpasses BreimanRF in classification tasks. To the best of our knowledge, DMRF is currently a low-complexity and high-performing variation of random forest that achieves strong consistency with probability 1.


Introduction
Random Forest (RF, also called standard RF or BreimanRF) [1] is an ensemble learning algorithm that makes classification or regression predictions by taking the majority vote or average of the results of multiple decision trees.Due to its simple and easy-to-understand nature, rapid training, and good performance, it is widely used in many fields, such as data mining [2,3,4], computer vision [5,6,7], ecology [8,9], and bioinformatics [10].
Although the RF algorithm has excellent performance in practical problems, analyzing its theoretical properties is quite difficult due to its highly data-dependent tree-building process.These theoretical properties include consistency, which can be weak or strong.Weak consistency refers to the expectation of the algorithm's loss function converges to the minimum value as the data size tends to infinity, while strong consistency refers to the algorithm's loss function itself converges to the minimum value as the data size tends to infinity [11].Consistency is an important criterion for evaluating whether an algorithm is excellent, especially in the era of big data.
Many researchers have made important contributions to the discussion of consistency-related issues in RF, proposing many variants of RF with weak consistency, such as Denil14 (also called Poisson RF) [12], Bernoulli RF (BRF) [13], and Multinomial RF (MRF) [14].However, the common feature of these algorithms is that the selection of split points and the determination of the final leaf node labels during the tree-building process are independent, i.e., using half of the training set samples to train the split points and the remaining half to determine the leaf node labels, which to a large extent causes insufficient growth of the basic decision trees.In addition, the introduction of random distributions such as Poisson distribution, Bernoulli distribution, and multinomial distribution in these algorithms can enhance their robustness but also have a certain impact on their performance.
In this paper, we propose a new variant of random forest algorithm called the Data-driven Multinomial Random Forest (DMRF) which has strong consistency, based on the foundation of MRF and BRF with weak consistency.The term "Data-driven" in this context does not mean that other variants of random forest algorithm are not data-driven, but rather indicates that the DMRF algorithm can make more effective use of data compared to the aforementioned variants with weak consistency.In the DMRF algorithm, we incorporate a new bootstrapping (slightly different from the standard bootstrapping in BreimanRF) that was not included in the previous variants, and introduce a Bernoulli distribution when splitting nodes.This Bernoulli distribution determines whether using optimal splitting criterion to obtain the splitting point (a splitting point consists of a splitting feature and a splitting value), or sampling the splitting point using two multinomial distributions based on impurity reduction.The reason for introducing the multinomial distribution is that it can perform random sampling of the optimal splitting feature and feature value with maximum probability [14].

Related work
BreimanRF [1] is an ensemble algorithm based on the prediction results of multiple decision trees, proposed by Breiman.It has shown satisfactory performance in practical applications.The basic process of BreimanRF can be divided into three steps: first, use bootstrapping to resample the dataset for the same number of times as the size of the dataset to obtain the training set for the basic decision tree; second, randomly sample a feature subspace of size D without replacement from the entire feature space of size D , and evaluate the importance of each feature and feature value in the subspace based on the reduction in impurity (e.g., Information entropy or Gini index) to obtain the optimal splitting point.Recursively repeat this process until the stopping condition is met, and a decision tree is obtained.Finally, repeat the above process to train multiple decision trees, and take the majority vote (for classification problems) or average (for regression problems) of the results of multiple decision trees to obtain the final prediction result.
Since the proposal of the RF model, many variants have been developed, such as Rotation Forest [15], Quantile RF [16], Skewed RF [17], Deep Forest [18], and Neural RF [19].These variants are proposed to further enhance the interpretability, performance, and efficiency of the random forest.Although the practical research on RF has developed rapidly, the progress of its theoretical exploration is slightly lagging behind.Breiman proved that the performance of BreimanRF is jointly determined by the correlation between basic trees and the performance of basic trees, i.e., the smaller the correlation between trees (i.e., the greater the diversity), and stronger the trees (i.e., the better the performance), the better the performance of RF [1].
The important breakthrough in the study of consistency of RF was proposed by Biau et al. [20].Biau proposed two simplified algorithms of BreimanRF: Purely Random Forest and Scale-Invariant Random Forest.Purely Random Forest randomly selects a feature and its feature value as the splitting feature and splitting value at each node.Scale-Invariant Random Forest also randomly selects a feature as the splitting feature at each node and randomly divides the samples into two parts according to the order of the feature values of that feature.Biau proved that both of these simplified versions have consistency.
Biau [21] proved another simplified RF model that is closer to BreimanRF and has weak consistency: randomly selecting a feature subspace at each node, and for each candidate feature, selecting the midpoint as the splitting value.When selecting among candidate features, the splitting feature and splitting value with the maximum reduction in impurity are selected to grow the tree.Denil et al. [12] proposed a new RF variant, Denil14, which is closer to RF. Denil14 divide the training set into structural part and estimation part.The structural part is used only for training the split points, and the estimation part is used only to determine the labels of the leaf nodes.In addition, at each node, the size of the feature subspace is selected based on the Poisson distribution, and the optimal splitting feature and splitting value are searched from m structural part samples that have been pre-selected.This variant has been proven to have weak consistency.Denil14 can be used for classification.
Inspired by the Denil14 model, Yisen Wang et al. [13] proposed a new Bernoulli RF (BRF) based on the Bernoulli distribution.Similar to Denil14, BRF divides the dataset into structural part and estimation part, trains the splitting points using the structural part, and determines the leaf node labels using the estimation part.However, BRF introduces two Bernoulli distributions at each node: one to determine whether the feature subspace size is 1 or D , and the other to determine whether the splitting value of each candidate feature is randomly selected or using the optimal splitting criterion.Finally, they proved that BRF also has weak consistency, but it is closer to BreimanRF and has better performance than previous RF variants with weak consistency.BRF can be used for both classification and regression.
Jiawang Bai et al. [14] transformed the reduction of each feature and its impurity value into probabilities through the softmax function, proposing a Multinomial Random Forest (MRF).MRF also divides the dataset into structural and estimation parts, using the structural part to train the splitting point and the estimation part to determine the leaf node label.Training the splitting point primarily involves two steps: first, calculating the maximum impurity reduction for each feature at each node and converting it into a probability, which is then considered the probability of the multinomial distribution from which the splitting feature is randomly selected; second, when selecting the splitting value, converting the impurity reduction of each feature value of the split feature obtained in the previous step into a probability and regarding it as the probability of the multinomial distribution from which the splitting value is randomly selected.Regarding determining the leaf node label, MRF views the proportion of each class of estimation part samples in the leaf node as a probability and randomly selects a class as the label from the multinomial distribution.MRF has more purposeful random selection of splitting features and values, with more reasonable probability allocations.It currently has the best performance among the consistent variants of RF, even surpassing BreimanRF.The disadvantage is its high computational complexity, it can cause large computational cost.MRF is only used for classification.
Due to the fact that Denil14, BRF, and MRF all chose to make the training of split points and leaf node labels independent to achieve weak consistency, this inevitably affects the performance of the base tree and reduces the overall performance of the algorithm.Based on this, we propose the Data-driven Multinomial Random Forest (DMRF) algorithm that can be used for both classification and regression problems.This algorithm can directly select the sample for the training of the splitting point to determine the leaf node label.Moreover, this algorithm introduces bootstrapping to increase the diversity between trees.More importantly, we enhance the weak consistency to strong consistency by modifying the conditions of the variants mentioned above.We found that although the theoretical basis is different, the proof methods are quite same as before.This means that the method for proving weak consistency can be strengthened to the method for proving strong consistency, resulting in a strong conclusion.
The arrangement of this paper is approximately as follows: Section 3 provides a detailed introduction to the classification and regression DMRF algorithm and proves its strong consistency; Section 4 provides some explanations of the experiments; Section 5 presents the experimental results and analysis; Section 6 concludes the paper and provides a future outlook on the work.{( , ), ( , ),..., ( , )}

Classification DMRF
, we are preparing to build M trees.

Training sample sampling
In the DMRF, we use a slightly different bootstrapping than the standard one to sample the training set ( ) for the j -th tree.Specifically, during sampling, we do not resample all samples, but instead sample each sample with a probability of q (which may be related to n , in this paper, we choose a constant value in (0,1]).That is, the sampling of each sample follows a binomial distribution (1, ) B q , and the probabilities q among different samples are independent.If the sampled training set is empty, we resample again.

Split point training process
First, let's introduce a very important function: the softmax function.
Definition 3.1：Given a n -dimensional vector 1 2 ( , ,..., ) , the softmax function is defined as follows: where e is the base of the natural logarithm.Obviously, after the transformation by softmax function, the elements of the vector are all numbers between 0 and 1, and their sum is equal to 1. Therefore, they can be seen as probabilities.
Reviewing the growth process of a classification tree in BreimanRF: In the node D , D features are randomly selected to form a feature subspace, which denote as where

Leaf node label determination
When an unlabeled sample x is given and the prediction is made on it, x will fall on a leaf node of the tree according to the algorithm.In the tree, the probability that the x is predicted to be class k is where ( ) is the leaf node where x falls into, ( ( )) is the sample number of node ( ) x N . According to the majority voting principle, the prediction of sample x under this tree is The prediction of DMRF is the result of majority voting in the base tree, i.e where ( ) ˆ( ) i y x is the predicted value of sample x in the i -th decision tree.If there are multiple categories with the same number of votes, we randomly select one of them as the final prediction class.
The pseudo-code of DMRF algorithm and decision tree construction is as follows: Algorithm1 DMRF classification tree construction process: Tree() 2. Output: A classification tree in DMRF.
3. While stopping condition is false do 4.
Compute the impurity reduction of all possible split points ij v at node D .

5.
Select the feature subspace with size D ; use the best split criterion to select the best splitting point.

else
The vector I composed of the maximum impurity reduction of each feature in the feature subspace is normalized to I  , compute the probability  , randomly select a splitting feature according to ( ) M  .
Assuming this feature is j A . 10.
The impurity reduction vector of feature j A selected in the previous step is normalized to ( ) randomly select the splitting features according to ( ) M  .

11.
Split the node D into left and right child nodes l D , r D by the split features and values obtained. 12.
, the number of trees 2. Output: DMRF's prediction for sample x .

for
Using bootstrapping with the probability of n q to get the training set i n D ( ) .

if
Go to line 4 for resampling.

7.
Training a classification tree with the training set i n D ( ) .
8. end for 9. Return: Predict the class of x by majority voting.

Strong consistency proof of classification DMRF
In this section, we prove the strong consistency of the classification DMRF algorithm, and the detailed proof process is in the appendix.
First, we talk about the consistency definition of classifier.For a classifier sequence{ } n g , the classifier n g is obtained by training the data set X Y which satisfying the distribution ( , ) X Y , and the error rate is where C is the randomness introduced in the training.Obviously, the condition of strong consistency is stronger than weak consistency, so strong consistency can derive weak consistency, but vice versa is not necessarily true.
Here are some important lemmas that will be used in the proof.Lemma 3.2 is quoted from Theorem 6 [20].Refer to [20] for more details. is sufficient to prove the strong consistency of the DMRF.However, using the whole dataset as the training set to build trees will lead to excessive computational costs and high costs with large sample sizes.Moreover, the similarity among the trees will significantly affect the performance of the algorithm.Therefore, we introduce Lemma 3.2, which adds bootstrapping to reduce the training cost while reducing the similarity between trees.
The strong consistency of a single tree is proved below.

Lemma 3.3:
Let n g be a binary tree classifier (that is, a parent node has only two child nodes) obtained by the n  sample partitioning rule n  , whose each region contains at least n k points, and / log ( ) x is the unique cell where the sample x falls into, ( )   is the Lebesgue measure of  .For all balls r S with radius r centered at the origin and for all 0   , with probability 1 for all distributions satisfying lim ({ : with probability one.In other words, n g is universally strongly consistent.
Lemma 3.3 is quoted from Theorem 21.2 and Theorem 21.8 [11], refer to [11] for more details.Lemma 3.3 shows that to prove strong consistency of a tree, we only need to prove that any leaf node is small enough, but the sample size in leaf node is large enough.
Based on the above lemmas, the strong consistency theorem of DMRF algorithm can be obtained:

Regression DMRF
In the last section we discussed the DMRF algorithm for classification, in this section we will discuss the DMRF algorithm for regression.

Regression DMRF Algorithm
In classification problems, we choose the Gini index to compute the impurity reduction, while in regression problems, we choose mean squared error(MSE) reduction as the metric for measuring the importance of features and feature values.

Denote the MSE of node
, i.e., the mean of the samples in this node; ( ) N D is the sample size of node D .Similar to the classification, when the split point is ij v , the where l D , r D denote the left and right child node of D splitted by ij v .
When making prediction, the predicted value of the tree is the sample mean of the leaf node ( ) n A x (where sample x resides), in other words , where ( ( )) n N A x denotes the sample size of ( ) n A x .The prediction of the forest is the mean of trees, that is ( ) where M denotes the tree number of the forest, ( ) ˆ( ) i y x is the prediction of i -th tree towards x .
The difference between the regression DMRF and the classification DMRF lies only in the difference in the splitting criteria for the splitting point and the prediction method.To obtain the regression DMRF, we only need to change the impurity reduction criterion to MSE reduction and the majority voting prediction to mean prediction in the classification DMRF.{( , ), ( , ),..., ( , )}

Strong consistency proof of regression DMRF
where C is the randomness introduced in the training.
Similar to the classification case, let's first define the strong consistency of a regression problem.
with probability one.
is strongly universally consistent.Lemma 3.6 is quoted from Theorem 23.2 [11], one can refer to [11] for more details.

Experiment
For the sake of convenience in narration, we use "(SE)" to indicate that the algorithm is from the original paper, like Denil14(SE), BRF(SE) and MRF(SE).We

Dataset selection
Our data sets are all from UCI database.Table 1 and Table 2 contain the sample number and feature number of classification and regression data sets respectively, and the classification data set also contains the number of class.In the two tables, we sort the data sets according to the sample number and test the datasets which cover wide range of sample size and feature dimensions in order to show the performance of DMRF.In addition, refer to [14], for missing values of all data sets, we use "-1" padding operation, and no other preprocessing was performed.Note: Due to the large value of the CSM, Facebook and SeoulBikeData datasets, the labels are log-transformed.

Baselines
We 3) MRF(SE) normalize the vector composed of the maximum impurity reduction of each feature when selecting the splitting feature, and convert it into probabilities using softmax function, which is used as multinomial distribution to randomly select the splitting feature.The impurity reduction form a vector corresponding to each value of the obtained splitting feature, normalize and convert this vector into probabilities by softmax function, which is used as multinomial distribution to randomly select the splitting value.
According to our method, we can abandon the separation of the structural part and the estimation part in the three models mentioned above.At the same time, we can add the bootstrapping method defined earlier, they can be Denil14(b), BRF(b) and MRF(b).The experiments will examine the performance of these three models in improving data utilization.

Performance Test Experimental Settings
In the performance experiment, in the above three models, i.e.Denil14(SE), BRF(SE) and MRF(SE), we set atio=0.5 R uniformly.Besides, we set 5 n k  and 100 M  according to [14] (The purpose of setting 5 n k  uniformly in this context is to promote extensive tree growth.As long as different algorithms use the same value of n k for a given data set, it ensures comparability).In Denil14, we set 100 m  , 10   according to [14].Following [13], we set 1 2 0.05 p p   .As [14] suggested, in MRF,

Standard Deviation Analysis
The parameters used in the analysis of standard deviation is the same as those used in performance experiment.

Parameter Test Experimental Settings
In the parameter testing experiment, we explore the influence of hyper parameters on DMRF.We focus on p , q , 1 B , 2 B .For p and q , the test range we take is [0.05, 0.95] with step size 0.1.For 1 B , 2 B the test range is the integer in [1,10] .In terms of dataset size, we define datasets with less than 500 data points as small, datasets with 500-1000 data points as medium, and datasets with more than 1000 data points as large.For classification problems, accuracy is used as the evaluation metric.For regression problems, negative mean squared error (NMSE) is used as the evaluation metric for ease of observation.

Performance Analysis
In the RF variants, the best performing result in the table is highlighted in bold.
To compare the performance of DMRF and BreimanRF, we use "*" to indicate that which is better.

Classification
The evaluation standard of classification problem is accuracy.From Table 3, the following conclusions can be drawn:  In the majority of cases, the (b)-type models show higher accuracy compared to their corresponding (SE)-type models (for example, MRF(b) achieves a 9% higher accuracy than MRF(SE) on the Winequality(white) dataset).This suggests that when the splitting criterion and leaf node label determination process are independent, a significant amount of information may be lost.Determining the leaf node labels based on the samples used to compute the splitting point helps reduce information loss. In all datasets, DMRF generally outperforms other RF variations, and the advantage of DMRF over MRF(b) is particularly evident with an improvement of 1% observed on some datasets (such as Obesity and Ai4i). In most cases, the accuracy of DMRF is higher than BreimanRF.This can be attributed to the use of the multinomial distribution for randomly sampling splitting values can be seen as a weakened version of optimal splitting, which enhances robustness.It indicates that introducing some randomness in classification tasks can enhance performance.

Regression
The evaluation criterion of regression problem is mean square error.From Table 4, the following conclusions can be drawn:  Similar to the classification case, in the majority of cases, the (b)-type models outperform their corresponding (SE)-type models (for example, there is an 48% reduction in MSE on the "Real estate" dataset).This suggests that determining the leaf node labels based on the samples used to compute the splitting point can better utilize information compared to the independent processes. In all datasets, DMRF generally shows the best performance among RF variations, but the advantage of DMRF over MRF(b) is not significant ,  In most cases, the MSE of DMRF is larger than that of BreimanRF.This is because MSE amplifies the impact of noise, and the randomness introduced by the multinomial distribution makes DMRF perform worse in regression compared to BreimanRF.In contrast, DMRF is better suited for classification tasks.

Standard Deviation Analysis
Given the inherent randomness in the models used for the experiments, it is essential to compare their levels of randomness.In this case, we will evaluate the randomness using the standard deviation as a metric.Furthermore, whether in classification or regression, in cases with small and medium sample sizes, DMRF tends to be more stable than MRF(b) and BreimanRF (for example, on the Wdbc and Qsar fish toxicity datasets).However, with large sample sizes, DMRF exhibits higher randomness compared to MRF(b) and BreimanRF (for example, on the Winequality(white), Connect-4, Insurance, and Combined datasets).This can be attributed to the introduction of Bernoulli and multinomial distributions in the process of finding splitting points in DMRF, which results in higher levels of randomness compared to MRF(b) and BreimanRF.Under small sample size cases, adding appropriate randomness helps increase the robustness of the model.However, in large sample size cases, the difference in randomness is amplified, leading to higher standard deviations for DMRF compared to MRF(b) and BreimanRF.

Parameter Analysis
In this section, we explore the influence of hyper parameters on DMRF.

The effect of p , q
We investigate p , q under 1 2 5 B B   as [14] recommended.has two situations: close to maximum and stable, or decreasing.Through analysis, it is known that when q is too small, the sampling probability of each sample is too low, resulting in insufficient training of trees; as q increases, the number of sampled samples increases, and the training of trees gradually becomes sufficient, resulting in the performance of DMRF increasing; However, when the number of samples reaches a certain value, the performance improvement slows down, and there will be a situation where the accuracy tends to be stable or even starts to decrease.The reason for the decrease is that the number of samples taken exceeds an appropriate value, resulting in high similarity between the training sets of trees, which affects the overall performance.

1)Classification
Since the optimal value of q is around 0.63, we set 1 1/ ( 0.6322) q e    to reduce computational burden while obtaining the optimal parameter.
It is worth noting that the bootstrapping used in this paper can be seen as a general form of the non-repeated resampling standard bootstrapping in the case of large samples.In fact, if we take n non-repeated samples of an n -sample dataset with equal probability, the probability of each sample being selected is . This is also one of the reasons why we chose the Under the same q , it can be observed that the accuracy increases initially with an increase in p and then decreases, reaching an optimal value around 0.5 p  . In DMRF, if 0 p  , the selection of splitting points at each node depends on the random selection of two multinomial distributions.If 1 p  , DMRF selects the optimal split point from the feature subspace, which is similar to BreimanRF.It can be seen that introducing some randomness when selecting split points in the feature subspace can enhance performance.Additionally, we can conclude that the algorithm is more sensitive to the parameter q compared to p , indicating that the number of training samples is more important than the method of split point selection.increases with the increase of q and reaches its maximum at around 0.63 before stabilizing or beginning to decrease.Under the same q , the NMSE initially increases with the increase of p , then decreases or stabilizes after reaching a certain point.

2)regression：
Therefore, the optimal q is considered to be 1 1/ e  and the optimal p value is considered to be 0.5.Additionally, it's still observed that the DMRF algorithm is more sensitive to parameter q than to parameter p .

The effect of
1)classification B is close to 1, the probabilities of each feature are not significantly different from each other, making it difficult to sample the optimal features.As 1 B grows, the differences in probabilities between features become larger, and more important features tend to be selected, improving the DMRF's performance.
However, when 1 B grows to a certain extent, the selection of features becomes similar to selecting the optimal feature, leading to the similarity between base decision trees being too high and causing a decrease in performance.The situation for 2 B is similar It can also be seen from the figure that the DMRF is not sensitive to 2 B but is more sensitive to 1 B .Since 1 B affects the selection of splitting features and 2 B affects the selection of split values, this can indicate that the impact of the splitting feature is greater.This also makes sense because the splitting value is obtained based on the splitting feature, so in general, the influence of the splitting feature is larger compared to the splitting value.decreases from 10 in general.After reaching around 1 5 B  , NMSE stabilizes or starts to decrease.Under the same 1 B , NMSE starts to increase as 2 B increases from 1 and stabilizes or slightly decreases at around 2 5 B  .The reason is that when 1 B is too large, the probability of selecting the optimal feature is much higher than that of other features, resulting in high similarity between trees and poor performance.

2)regression
As 1 B decreases, more randomness is introduced, preserving the high performance of trees while increasing diversity, improving the algorithm's performance.When 1 B is close to 1, the probabilities of each feature are not significantly different from each other, making it difficult to sample optimal features, leading to the poor performance of trees and the algorithm.
The situation for 2 B is similar to that of 1 B .Futhermore, it can be observed from the Fig 4 .that DMRF is more sensitive to 1 B than 2 B , which is similar to that of the classification.Since 1 B affects the selection of splitting features and 2 B affects the selection of split values, it can be concluded that the impact of the splitting feature is greater.Winequality (white) dataset has been multiplied by 100 for ease of observation.

The effect of number of trees
In the classification case, it can be observed that as M increases, DMRF shows an upward trend in accuracy for all three datasets and gradually converges after reaching 100 trees.Similarly, in the regression case, as M increases, DMRF shows a downward trend in MSE for all three datasets and gradually converges after reaching 100 trees.
This demonstrates that DMRF, as a method of Bagging, improves its performance gradually with an increasing number of base learners until convergence.

Cross-validation
In this section, we use cross validation to determine the optimal parameters for DMRF, MRF(b), BRF(b), and Denil14(b) on various classification and regression datasets.For all models, set {0.2, 0.  In both classification and regression, comparing the performance of DMRF with MRF(b), BRF(b), and Denil14(b) under the optimal parameters, it can be seen from the Table 6 that DMRF outperforms the other models.Compared to the MRF(b) , DMRF shows improvements in performance.The reason behind this can be analyzed as follows: Although the MRF(b) uses the softmax function to convert features and feature importance into probabilities and samples splitting points using a multinomial distribution, it considers the entire feature space.The higher the importance of a feature or feature value, the higher its probability.Additionally, the softmax function amplifies the differences between features or feature values.Therefore, the probability of sampling the optimal feature and optimal feature value remains the highest.This can be considered as a weakened version of optimal splits in the full feature space.
Although this approach improves performance, it reduces the diversity among trees.
On the other hand, DMRF selects optimal splits in the feature subspace based on probabilities and performs multinomial distribution sampling, which increases the diversity among trees and thus improves performance.

Computational complexity analysis
Assume that the data set has n samples and D features, we prepare to build M trees.The complexity of random sampling is not considered below.
The best case for tree construction is complete balanced growth, in this case, the

RF variants Complexity
From Table5, it can be seen that MRF(b) has the highest complexity, followed by DMRF and BreimanRF.Due to the sampling process involved in selecting split points, DMRF will generally take slightly longer than BreimanRF in most cases.Due to the generally small values of 1 p , 2 p , in most cases, the complexity of BRF is slightly lower than that of DMRF and BreimanRF.As for Denil14(b), the complexity ranking is determined by the values of  ， m .only selects 100 points to compute splitting points, it has a faster speed.

Conclusions and future works
The main contributions of this paper are as follows: ①By modifying the condition for the number of samples in leaf nodes, the weak consistency proofs in previous RF variants have been improved to strong consistency

□
The proof of Lemma 3.4： Every base tree is strongly consistent, i.e., | ] 0( ) where ( ) i C is the randomness introduced in i -th tree building, (c) uses Cauchy inequality.

□
The proof of Theorem 3.1： First, we proof the number of samples in each leaf node of DMRF tree has at least n k with probability 1 when n   .
Due to the randomness of the split point selection, the final selected split point can be regarded as a random variable W , which follows the uniform distribution on [0,1], and its cumulative distribution function is  and a certain 0 1    , the smallest child node after the root node splits according to a splitting feature is denoted as 1 min( ,1 ) Without loss of generality, we can normalize the value of all attributes to range [0,1] for each node.If the feature is continuously split m times (i.e. the tree grows to the m -th layer), the probability that the smallest child node in m -th layer has the size at least  is ) In this case, if The above results are based on the fact that the same feature is selected for each split.In fact, if different features are split at different layers, ( ) m P M   will be greater, we still have This indicates that the size of each node is  in m -th layer with the probability at least 1   .
Since X has a non-zero density function, each node in the m -th layer of the tree has a positive metric with respect to X  .Define  Chebyshev's inequality.This suggests that the probability of reaching the stop condition will converge to 0 as n   , which means that can split infinitely many times with probability 1.
It is sufficient to show that it satisfies the conditions of Lemma 3.3 with probability 1. Obviously, we just need to prove that ( ( )) 0 n diam A x  as n   with probability 1.Let ( ) V i denote the size of the i -th feature of ( ) n A x , we only need to show that [ ( )] 0 E V i  for all {1, 2,..., } i D  .
Without loss of generality, at each node, we will scale each feature to [0, 1].
First, we define the following events: 1 E  {i-th feature is a candidate feature}, 2 E  {use optimal split criterion to get splitting point}, 3 E  {i-th feature is a splitting feature}.For a given i , denote the largest size among its child nodes as * ( ) V i .
Let i W be the position of the splitting point, then When 2 E happens, during the process of selecting the splitting feature, the normalized vector of impurity reduction (denoted as Î , which is an n -dimensional vector) is considered.When the i-th element of Î is 0 and all other elements are 1, the probability of selecting the i-th feature as the splitting feature is minimized.
Therefore, .Thus, it can be inferred that ).
The above process is the result of one time split.If the i -th feature is splited m times, the following formula can be obtained by iterating the above formula continuously: We have proven that ( ) m n     with probability 1, so the strong consistency of DMRF can be obtained with probability 1.
In summary, ( ( )) 0( ) That is, the base regression tree is universally strongly consistent.Therefore, we only need to prove that the conditions of the above two limits are true.The former condition has been proved in the consistency proof of classification DMRF algorithm, so only the latter is needed to prove.
For the base tree with n training samples, there is at most The latter is true.Therefore, the strong consistency of the regression DMRF algorithm is obtained.□

Lemma 3 . 1 :Lemma 3 . 2 :
Assume that the classifier sequence { } n g is (universally) strongly consistent, then the majority voting classifier ( ) M n g (for any value of M ) is also (universally) strongly consistent.Assume that the classifier sequence { } n g is strongly consistent, the bagging majority voting classifier M n g (for any value of M ) is also strongly consistent if lim

Lemma 3 .
1 shows that to prove an ensemble classifier with strong consistency, we only need to prove that its base classifier has strong consistency.The universal strong consistency of ensemble classifiers is obtained from the universal strong consistency of base classifiers.Lemma 3.2 can be regarded as a corollary of Lemma 3.1, which shows that the use of bootstrapping does not affect the consistency of the ensemble algorithm.It is worth noting that Lemma 3.1 alone (without bootstrapping)

Theorem 3 .
2：Assume that X is supported on [0,1] D and has non-zero density almost everywhere, the cumulative distribution function (CDF) of the split points is left-continuous at 1 and right-continuous at 0. If 1 B , 2 B both positive and finite, DMRF is strongly consistent with probability 1 when / log use "(b)" to indicate the algorithm which use the bootstrapping defined in this paper and without separating the structural part and the estimation part, like Denil14(b), BRF(b), MRF(b).The experiments are divided into three parts: performance test, standard deviation analysis and parameter test.Performance test evaluates the performance of DMRF in classification and regression problems and compares it with three other consistent RF variants (both weakly and strongly consistent), as well as BreimanRF, to demonstrate DMRF's performance.Standard deviation analysis section evaluates the standard deviation of RF variants in classification and regression problems to measure the randomness of different RFs.Parameter test discusses the impact of hyper-parameters p , q , 1 B , 2 B on the performance of DMRF and provides some recommendations for selecting optimal parameters.
choose three proposed RF variants with weak consistency, Denil14(SE), BRF(SE) and MRF(SE), as the comparison model of DMRF.Their common feature is that the dataset is divided into the structure part and the estimation part according to the hyper parameter Ratio , the structure part is used for split points training and the estimation part for leaf node labels determination. 1) Denil14(SE) randomly selects m points of the structure part at each node, then selects the feature subspace with the size of min(optimal splitting point within the range defined by the m points preselected (not the entire number of data points).2) BRF(SE) introduces the first Bernoulli distribution when selecting feature subspace, that means, a feature is randomly selected from the feature set with the probability of 1 p as the split feature, or D features are randomly selected from the feature set with the probability of 1 1 p  as the candidate feature.The second Bernoulli distribution is introduced in the selection of split values, that means, a value is randomly selected as the split value from the split features with the probability of 2 p , or the value with the probability of 2 1 p  is selected from the split features with the largest impurity reduction.

Fig. 1
Fig. 1 Accuracy(%) of the DMRF under different p, q valuesFig.1showsthe performance of DMRF on three classification datasets with small, medium, and large sample sizes under 1

Fig. 2
Fig. 2 Negative mean square error of the DMRF under different p, q values

Fig. 2
Fig.2shows the performance of DMRF on three regression datasets with small, medium, and large sample sizes under 1

Fig. 3
Fig. 3 Accuracy(%) of the DMRF under different B1, B2 valuesFig.3 shows the impact of 1 B and 2 B on the DMRF algorithm under 1 1/ q e   and 0.5 p for three classification datasets with small, medium, and large sample

Fig. 4
Fig. 4 Negative mean square error of the DMRF under different B1, B2 values Fig. 4 shows the impact of 1 B and 2 B on the DMRF algorithm under 1 1/ q e   and 0.5 p for three regression datasets with small, medium, and large sample sizes.

Fig. 5
Fig. 5 Performance of DMRF under different number of trees The left side of Fig 5. shows the performance trends of the Blogger, Tic-tac-toe, and Winequality (white) datasets in terms of accuracy as the number of trees (i.e.M ) increases in classification.The right side shows the performance trends of the Real estate, Winequality (white) , and Combined datasets in terms of mean squared error (MSE) as M increases in regression .It is worth noting that the MSE of the

Fig. 6
Fig. 6 Running time in one iteration of cross-validation for different modelsFig.6 shows the running time of one iteration of cross-validation for three classification datasets (Tic-tac-toe, Winequality(white), Connect-4) and three regression datasets (Alcohol, Flare, Insurance) under parameters mentioned in Section 4.3.It can be observed that BRF(b) has shorter runtime in both classification and regression tasks, while MRF(b) has longer runtime in both tasks.As we analyzed earlier, in most cases, BreimanRF has shorter runtime compared to DMRF.It is worth noting that in Connect-4, a big dataset for classification,Denil14(b) has longer runtime compared to DMRF, BRF(b), and BreimanRF, while in Insurance, a big dataset for regression,Denil14(b) has shorter runtime compared to DMRF, BRF(b), and BreimanRF.This is because Connect-4 has 42 features, and the average size of Denil14(b)'s feature subspace is 11, while the average size of feature subspaces for DMRF, BRF(b), and BreimanRF is 6.Although Denil14(b) only selects 100 points to compute split points, the experimental results indicate that the impact of feature subspace size on runtime in Connect-4 is greater than calculating splitting points for 100 samples each time.In Insurance, the number of features is 86, and the average size of Denil14(b)'s feature subspace is 11, while the average size of feature subspaces for DMRF, BRF(b), and BreimanRF is 9, which is not significantly different.However, because Denil14(b) proofs.The previously proposed weak consistency models, such as Denil14(SE), BRF(SE), and MRF(SE), have been enhanced to models with strong consistency in probability, namely Denil14(b), BRF(b), and MRF(b).②Weintroduces a novel algorithm called DMRF, which combines Bernoulli and multinomial distributions.DMRF utilizes a modified bootstrapping method to obtain training sets for base trees and uses the combination of Bernoulli and multinomial distributions to determine the splitting points during tree construction.This approach increases diversity while maintaining high-performance base trees.③Theexperiments indicate that DMRF outperforms MRF(b) and BreimanRF in classification tasks.However, in regression tasks, DMRF performs better than MRF(b) but the difference is not significant.In most cases, DMRF's performance is not as good as BreimanRF, suggesting that DMRF is more suitable for classification tasks.④ In terms of standard deviation, DMRF has lower standard deviation than MRF(b) and BreimanRF on small and medium-sized datasets.However, on large datasets, DMRF is more likely to have a higher standard deviation compared to MRF(b) and BreimanRF.In terms of time complexity, DMRF has the same complexity as BreimanRF.The complexity of BRF(b) and Denil14(b) is determined by the parameter settings, while MRF(b) has the highest complexity.The main advantages of DMRF lie in its strong theoretical properties, excellent performance, and lower complexity(same as BreimanRF).It shows clear advantages on small sample datasets.However, one limitation is that DMRF shows higher randomness than BreimanRF on large dataset.Future research can focus on addressing this limitation of increased randomness in DMRF for large sample cases.

1
the measure of each leaf node is positive and the number of leaf nodes is finite.The number of samples in the training set is n , and the number of samples in the leaf node N is ( , )B n  , then

□
diam A x n    with probability 1, DMRF tree is strongly consistent with probability 1.By lemma 3.2, DMRF algorithm has strong consistency with probability 1.The proof of Theorem 3.2： By lemma 3.4, the strong consistency of DMRF in regression problem is based on the strong consistency of trees.The following proves the strong consistency of the base regression tree.

1 (
obtained.Since the sample number of each cell is at least n k , i.e.,  , when n is sufficiently large,

Table 1
The description of benchmark classification datasets

Table 2
The description of benchmark regression datasets

Table 3
Accuracy(%) of different RFs on benchmark datasets

Table 4
Mean square error(%) of different RFs on benchmark datasets

Table 5
Standard deviation of different RFs on classification and regression datasets observed that, in both classification and regression task, in most cases, the (b)-type models show larger standard deviations compared to their corresponding (SE)-type models (for example, MRF(b) has a larger standard deviation than MRF(SE) on the Blogger and ALE datasets).This indicates that, under similar conditions, using bootstrapping to get training sets introduces greater randomness compared to divide the dataset into structure part and estimation part.

Table 6
Results of cross-validation of different RFs

Table 7
Computational complexity of RFs