| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by yters 2575 days ago
	I would have assumed the opposite, with enough trees you are guaranteed to overfit your data? Boosting increases the VC dimension of the aggregate model, which makes it more prone to overfitting.

3 comments

pplonski86 2575 days ago

Random Forest doesn't increase overfit error when adding more trees. I did the experiment on toy dataset to check it https://mljar.com/blog/random-forest-overfitting/

What is more, Leo Breimain wrote on his website: "Random forests does not overfit. You can run as many trees as you want" https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home...

link

rq1 2575 days ago

These are (false) claims, therefor not proofs.

Deep trees will fortunately overfit your dataset.

Any binary tree of depth log2(P) can completely separate your P points.

link

ramraj07 2574 days ago

But we are merging hundreds of trees each of which has been handicapped by removal of multiple features and a fraction of the data. Sounds to me like overfitting is not easy (no single data point or feature contributes to every tree so it can't be represented all the time).

False claims as they maybe, these are claims I've seen in at least two of the most commonly studied statistical learning text books, so given that it makes sense and that it's in the text books, it seems reasonably not false to me. Someone else posted that if too many features or data points are very similar then it will overfit, and that totally makes sense. Whatever you say doesnt. Clarification would be useful.

link

yters 2574 days ago

Adding bunches of trees will overfit the accidental patterns in your data.

I have an explanation here why reducing variance is not the same as reducing overfitting: https://news.ycombinator.com/item?id=20089890

link

thom 2574 days ago

Yeah, at a certain point it's just playing 20 questions with your data.

link

olooney 2575 days ago

Random Forest doesn't use boosting. Tree models like XGBoost which do use boosting are indeed very prone to overfitting.

link

gbrown 2575 days ago

Nope, RF works very different to boosting. RF trees give unbiased fits, but they're high variance. Bootstrapping is used to reduce the variance of the parallel tree fits.

Boosting is sequential, and relies on early stopping to control the magnitude of bias.

link

rq1 2575 days ago

> RF trees give unbiased fits, but they're high variance.

Sounds like overfitting.

link

olooney 2575 days ago

Individual trees are high variance. The random forest itself is an ensemble of many trees - a "forest" if we've being cute. Each tree in the forest is randomized training on a bootstrap sample. This is sometimes called "bagging", a portmanteau of "bootstrap" and "aggregating." Each tree may be further randomized by selecting a different subset of dimensions to consider each time we split a node. The end result is that each tree uses very different rules to make its prediction. When all of these predictions are combined (by voting for classification, or by averaging for regression) error due to overfitting tends to cancel out, while signal due to the same true pattern being discovered independently by many trees is amplified. Thus we can continue to add trees indefinitely without worrying about over-fitting. Other hyperparameters of random forest, such as max tree depth or the minimum number of samples in a node necessary for splitting, can result in overfitting or underfitting so need to be tuned. However, because forest will eventually fit the data set even if we use so-called "stump" learners (max tree depth=1) we can choose very conservative parameters like max-depth=3 which makes trees less likely to overfit. And if they underfit, well, that's not a problem, the ensemble will take care of that. The number of trees in a forest can be cranked up as high as we want without worrying about overfitting; the only downside is that training takes longer, the model takes more space on disk and in memory, and predictions take longer to run.

link

rq1 2574 days ago

Yes that’s exactly overfitting the bootstrapped samples, thus the high variance.

The « variance » won’t just magically vanish as you average things out[1], you need to change the scale and check out the asymptotic law of your estimator (CLT, Kolmogorov-Smirnov… etc.) and confront it to your data.

[1] the variance of the estimator itself vanishes thanks to LLN (in case of convergence), but that’s not actually the quantity of interest

Edit: don't get me wrong, I'm not saying that RFs are good or bad, just reacting to the bias/variance thing.

link

yters 2574 days ago

Based on the downvotes, it seems people think reducing variance is the same as reducing overfitting.

Think of the bias/variance tradeoff as a spotlight, and we are shining the spotlight on a bunch of cats, who reflect back the spotlight when their eyes are open. Eyes are open or closed randomly. Cat eyes are either green or brown. We want to know the distribution of cat eyes in parts of the population, which in general is an even 50/50 split. We determine the distribution in a certain location by taking the average of the eyes we see.

If variance is large, then the spotlight is very large, and we don't learn anything because we just average the entire population.

If the spotlight is small, then we can learn something, but only if there are enough samples in the region we shine the light.

So, what if we start with a large spotlight, and then when we see a region with a large number of open eyes of one color, we narrow the light down to just that region? Won't that allow us to avoid overfitting, while maximizing our ability to learn?

It unfortunately does not, because with a large enough population that is evenly distributed, there will always be pockets that exhibit what appear to be a pattern, but is just an accident of which cats happened to open their eyes.

This scenario of starting with the spotlight large and then zooming into a patterned region is the same as reducing variance with the training data. With a large enough dataset it is always possible to find these accidental patterns and then zoom into them by reducing variance.

link

yters 2573 days ago

More thoughts on this seemingly controversial claim.

https://stats.stackexchange.com/questions/20714/does-ensembl...

link

yters 2575 days ago

Ensemble models increase VC dimension so they are more prone to overfitting.

Example, you could have a tree that individually segments each data point and memorize your dataset. That's the definition of overfitting.

link

gbrown 2574 days ago

Sometimes, but even simple trees are high variance given that they're estimated using greedy algorithms rather than some more global optimization. Overfitting in RF does not occur as a function of the number of trees.

link