A random forest guided tour 207 ● 5,5 6,8● 5,7 5,7 6,0 5.3 5,8 。26,2 ● 4,9 。251 14,8 ●16,9 0 Fig.1 A centered tree at level 2 using forests in place of individual trees and is too simple to explain the mathematical forces driving Breiman's forests. The rates of convergence of centered forests are discussed in Breiman(2004)and Biau (2012).In their approach,the covariates X(are independent and the target regression function m(x)=E[YX =x],which is originally a function of x (x(1),....),is assumed to depend only on a nonempty subset S(for Strong)of the p features.Thus,letting Xs=(X():jeS),we have m(x)=E[Y Xs =xs]. The variables of the remaining set (1,...,p)\S have no influence on the function m and can be safely removed.The ambient dimension p can be large,much larger than the sample size n,but we believe that the representation is sparse,i.e.,that a potentially small number of arguments of m are active-the ones with indices matching the set S.Letting IS be the cardinality of S,the value S characterizes the sparsity of the model:the smaller SI,the sparser m.In this dimension-reduction scenario,Breiman (2004)and Biau(2012)proved that if the probability pi.n of splitting along the jth direction tends to 1/S and m satisfies a Lipschitz-type smoothness condition,then EmaX-mX=o(n器s) This equality shows that the rate of convergence of m to m depends only on the number S of strong variables,not on the dimension p.This rate is strictly faster than the usual rate n-2/(p+2)as soon as 0.54p](is the floor function). In effect,the intrinsic dimension of the regression problem is ISl,not p,and we 2Springer
A random forest guided tour 207 Fig. 1 A centered tree at level 2 using forests in place of individual trees and is too simple to explain the mathematical forces driving Breiman’s forests. The rates of convergence of centered forests are discussed in Breiman (2004) and Biau (2012). In their approach, the covariates X(j) are independent and the target regression function m(x) = E[Y |X = x], which is originally a function of x = (x(1) ,..., x(p) ), is assumed to depend only on a nonempty subset S (for Strong) of the p features. Thus, letting XS = (X(j) : j ∈ S), we have m(x) = E[Y |XS = xS]. The variables of the remaining set {1,..., p}\S have no influence on the function m and can be safely removed. The ambient dimension p can be large, much larger than the sample size n, but we believe that the representation is sparse, i.e., that a potentially small number of arguments of m are active—the ones with indices matching the set S. Letting |S| be the cardinality of S, the value |S| characterizes the sparsity of the model: the smaller |S|, the sparser m. In this dimension-reduction scenario, Breiman (2004) and Biau (2012) proved that if the probability p j,n of splitting along the jth direction tends to 1/S and m satisfies a Lipschitz-type smoothness condition, then E m∞,n(X) − m(X) 2 = O n −0.75 |S| log 2+0.75 . This equality shows that the rate of convergence of m∞,n to m depends only on the number |S| of strong variables, not on the dimension p. This rate is strictly faster than the usual rate n−2/(p+2) as soon as |S|≤0.54p (· is the floor function). In effect, the intrinsic dimension of the regression problem is |S|, not p, and we 123
208 G.Biau,E.Scornet see that the random forest estimate adapts itself to the sparse framework.Of course, this is achieved by assuming that the procedure succeeds in selecting the informative variables for splitting,which is indeed a strong assumption. An alternative model for pure forests,called Purely Uniform Random Forests (PURF)is discussed in Genuer(2012).For p =1,a PURF is obtained by drawing k random variables uniformly on [0.1],and subsequently dividing [0,1]into random sub-intervals.(Note that as such,the PURF can only be defined for p=1.)Although this construction is not exactly recursive,it is equivalent to growing a decision tree by deciding at each level which node to split with a probability equal to its length Genuer (2012)proves that PURF are consistent and,under a Lipschitz assumption, that the estimate satisfies E[monX-m(X2-0(n-2/) This rate is minimax over the class of Lipschitz functions(Stone 1980,1982). It is often acknowledged that random forests reduce the estimation error of a single tree,while maintaining the same approximation error.In this respect,Biau(2012) argues that the estimation error of centered forests tends to zero (at the slow rate 1/log n)even if each tree is fully grown (i.e.,k log n).This result is a consequence of the tree-averaging process,since the estimation error of an individual fully grown tree does not tend to zero.Unfortunately,the choice k log n is too large to ensure consistency of the corresponding forest,whose approximation error remains constant. Similarly,Genuer(2012)shows that the estimation error of PURF is reduced by a factor of 0.75 compared to the estimation error of individual trees.The most recent attempt to assess the gain of forests in terms of estimation and approximation errors is by Arlot and Genuer(2014),who claim that the rate of the approximation error of certain models is faster than that of the individual trees. 3.2 Forests,neighbors and kernels Let us consider a sequence of independent and identically distributed random variables X1,...,Xn.In random geometry,an observation Xi is said to be a layered nearest neighbor (LNN)of a point x (from X1,...,Xn)if the hyperrectangle defined by x and Xi contains no other data points (Barndorff-Nielsen and Sobel 1966;Bai et al. 2005;see also Devroye et al.1996,Chapter 11,Problem 6).As illustrated in Fig.2, the number of LNN of x is typically larger than one and depends on the number and configuration of the sample points. Surprisingly,the LNN concept is intimately connected to random forests that ignore the resampling step.Indeed,if exactly one point is left in the leaves and if there is no resampling,then no matter what splitting strategy is used,the forest estimate at x is a weighted average of the Yi whose corresponding Xi are LNN of x.In other words. moo.n(x)= Wni(x)Yi. (3) 鱼Springer
208 G. Biau, E. Scornet see that the random forest estimate adapts itself to the sparse framework. Of course, this is achieved by assuming that the procedure succeeds in selecting the informative variables for splitting, which is indeed a strong assumption. An alternative model for pure forests, called Purely Uniform Random Forests (PURF) is discussed in Genuer (2012). For p = 1, a PURF is obtained by drawing k random variables uniformly on [0, 1], and subsequently dividing [0, 1] into random sub-intervals. (Note that as such, the PURF can only be defined for p = 1.) Although this construction is not exactly recursive, it is equivalent to growing a decision tree by deciding at each level which node to split with a probability equal to its length. Genuer (2012) proves that PURF are consistent and, under a Lipschitz assumption, that the estimate satisfies E[m∞,n(X) − m(X)] 2 = O n−2/3 . This rate is minimax over the class of Lipschitz functions (Stone 1980, 1982). It is often acknowledged that random forests reduce the estimation error of a single tree, while maintaining the same approximation error. In this respect, Biau (2012) argues that the estimation error of centered forests tends to zero (at the slow rate 1/ log n) even if each tree is fully grown (i.e., k ≈ log n). This result is a consequence of the tree-averaging process, since the estimation error of an individual fully grown tree does not tend to zero. Unfortunately, the choice k ≈ log n is too large to ensure consistency of the corresponding forest, whose approximation error remains constant. Similarly, Genuer (2012) shows that the estimation error of PURF is reduced by a factor of 0.75 compared to the estimation error of individual trees. The most recent attempt to assess the gain of forests in terms of estimation and approximation errors is by Arlot and Genuer (2014), who claim that the rate of the approximation error of certain models is faster than that of the individual trees. 3.2 Forests, neighbors and kernels Let us consider a sequence of independent and identically distributed random variables X1,..., Xn. In random geometry, an observation Xi is said to be a layered nearest neighbor (LNN) of a point x (from X1,..., Xn) if the hyperrectangle defined by x and Xi contains no other data points (Barndorff-Nielsen and Sobel 1966; Bai et al. 2005; see also Devroye et al. 1996, Chapter 11, Problem 6). As illustrated in Fig. 2, the number of LNN of x is typically larger than one and depends on the number and configuration of the sample points. Surprisingly, the LNN concept is intimately connected to random forests that ignore the resampling step. Indeed, if exactly one point is left in the leaves and if there is no resampling, then no matter what splitting strategy is used, the forest estimate at x is a weighted average of the Yi whose corresponding Xi are LNN of x. In other words, m∞,n(x) = n i=1 Wni(x)Yi, (3) 123