|
|
|
|
|
by olooney
2575 days ago
|
|
Individual trees are high variance. The random forest itself is an ensemble of many trees - a "forest" if we've being cute. Each tree in the forest is randomized training on a bootstrap sample. This is sometimes called "bagging", a portmanteau of "bootstrap" and "aggregating." Each tree may be further randomized by selecting a different subset of dimensions to consider each time we split a node. The end result is that each tree uses very different rules to make its prediction. When all of these predictions are combined (by voting for classification, or by averaging for regression) error due to overfitting tends to cancel out, while signal due to the same true pattern being discovered independently by many trees is amplified. Thus we can continue to add trees indefinitely without worrying about over-fitting. Other hyperparameters of random forest, such as max tree depth or the minimum number of samples in a node necessary for splitting, can result in overfitting or underfitting so need to be tuned. However, because forest will eventually fit the data set even if we use so-called "stump" learners (max tree depth=1) we can choose very conservative parameters like max-depth=3 which makes trees less likely to overfit. And if they underfit, well, that's not a problem, the ensemble will take care of that. The number of trees in a forest can be cranked up as high as we want without worrying about overfitting; the only downside is that training takes longer, the model takes more space on disk and in memory, and predictions take longer to run. |
|
The « variance » won’t just magically vanish as you average things out[1], you need to change the scale and check out the asymptotic law of your estimator (CLT, Kolmogorov-Smirnov… etc.) and confront it to your data.
[1] the variance of the estimator itself vanishes thanks to LLN (in case of convergence), but that’s not actually the quantity of interest
Edit: don't get me wrong, I'm not saying that RFs are good or bad, just reacting to the bias/variance thing.