Alternative stopping rules to limit tree expansion for random forest models

We have presented a number of alternative tree expansion stopping rules for RF models. It seems that for some datasets, especially NHANES, Tasmanian Abalone and Los Angeles Ozone data, the new types of stopping rules we are fitting have a very similar MSPE to the standard stopping rules normally used by RF models (Table 2, Fig. 1). However, for two other data sets, the Boston Housing and MIT Servo data, it is clear that two particular variants of stopping rules fit significantly better than the standard RF model (Table 2, Figure 1). In general, using the 25-75% intercentile range statistic to control tree expansion produces much less variation in MSPE, and MSPE also closer to the optimum. The MSPE for this measure does not exceed 5% of the MSPE for the best tree expansion method for each data set (Fig. 1).

One of the parameters of the RF algorithm is the minimum node size below which the node would remain unsplit. This is very commonly available in RF algorithm implementations, especially in the randomForest package4. The problem of node size selection in RF models is much studied in the literature. In particular Probst et al.seven revisit the topic of hyperparameter tuning in RF models, with a subsection devoted to choosing the size of the terminal node. This has also been discussed from a more theoretical point of view in a related article by Probst et al.6. Like Probst et al. document, the optimal node size is often quite small, and in many packages the default value is set to 1 for classification trees and 5 for regression treesseven. There are a number of packages available that allow alternatives to the standard parent node size limit for splitting nodes. In particular the randomForestSRC8 and the party kit9.10 Both R packages allow for limiting splits by the size of the descendant node. To the best of our knowledge, no statistical package uses the range, variance, or percentile range based limits demonstrated here. It should be noted that the use of parent and descendant node size limits is not equivalent. Although it is obviously true that if the node size of the offspring is at least (not) then the size of the parent node must be at least (2n), the reverse is clearly not the case. For example, it may be that among the candidate divisions of a particular node of size (2n) would in general be descending nodes of sizes (1,2,…,n – 1,n,n + 1,…2n – 1). If one insisted that the terminal nodes be of size (not) only then the split into two nodes each of size (not) would be considered, while without restriction on the size of the terminal nodes the potential candidates would generally include nodes of size (1,2,…,n – 1,n + 1,…2n – 1) also, although split variables in general may not allow all of this to happen.

Many variations of the RF model have been created, many with implementations in R software. For example, RF quantile regression was introduced by Meinshausen11 and combines quantile regression with random forests and its implementation provided in the quantregForest package. Garge et al.12 implemented model-based partitioning of the feature space and developed the associated R software mobForest (although this has now been removed from the CRAN archive). Seibold et al.13 also used recursive partitioning RF models that were fitted to amyotrophic lateral sclerosis data. Seibold et al. have also developed software to adapt such models, in the R package model4you14. Segal and Xiao15 described the use of RF for multivariate results and developed the MultivariateRandomForest R package16 to mount such models. A number of more specialized RF algorithms have also been developed. Pari and Athey17 used concepts from causal inference and introduced the idea of ​​a causal forest. Foster et al.18 also used standard RFs as part of a causal (counterfactual) approach to identifying subgroups in randomized clinical trial data. Li et al.19 applied more standard RF models to analyze data from multicenter clinical trials. An algorithm that combines RF methods and Bayesian Generalized Linear Mixed Models for the analysis of grouped and longitudinal binary results, called the binary mixed model forest was developed by Speiser et al.20, using standard R packages. Quadrianto and Ghahramani21 also proposed a new RF algorithm incorporating Bayesian elements, which they implemented in Matlab, and compared this model with a number of other machine learning approaches in the analysis of a number of sets of data. Ishwaran et al.22 describes a survival RF algorithm applicable to right-censored survival data; a randomSurvivalForestSRC R package (now removed from the CRAN repository) has been written to implement this pattern, among other temporal RF variants. For genomic inference, two R packages implementing standard RF models have been developed by Díaz-Uriarte and de Andrés23 and Diaz-Uriarte24, GeneSrF and varSelRF. RFs were used in the meta-analysis, and a software implementation is provided by the metaforest R package25. The grf:geographical random forest package by Georganos et al.26 provides an implementation of the RF model specifically for geographic analyses.

Our main objective was the improvement of the prediction error, as measured by the MSPE. Attempts have been made to reduce bias in RF models, a related but different problem. Zhang and Lu27 outlines five different methods to achieve this. Song described a different method of bias correction, via residual rotation28. Reducing bias is obviously important, although machine learning methods often prioritize reducing prediction errors, even at the cost of introducing a small amount of bias.29. In principle, it would be possible, although in some computationally tedious cases, to determine the MSPE uncertainties using a dual bootstrap.

We have defined stopping rules with specific application to regression trees. However, the basic idea would obviously be easily transposable to classification trees, using for example Gini or cross-entropy loss functions.

Comments are closed.