Hacker News new | ask | show | jobs
by YeGoblynQueenne 2199 days ago
>> The way I see it, the variance is the part of the error that you can reduce by collecting more data from your distribution and increasing model complexity if needed.

Ah, apologies, I see what you mean. That is true, but this "error" is in-sample error, so increasing your model's variance will increase its ability to interpolate but not extrapolate to out-of-sample data, as I explain in my longer comment.

"In-sample" means all the data you've collected to train and test with. It includes training/validation/test splits. At the end of k-fold cross-validation, your model has "seen" all the data in your sample and the model that performs best is the model that best represents that data.

But, because the data was sampled from a distribution that is most likely not the true distribution of the data (since that distribution is unknown), the sampling error (i.e. the differences between the true and sample distributions) will be reflected in the model. A high-variance model will suffer more from this than a high-bias one.

Sorry I didn't understand immediately what you meant. The longer comment above is correct but probably doesn't help answer your question directly.

1 comments

Thanks for taking the time to write the detailed reponses. Definitely led me to think more closely about these vaguely held intuitions about bias and variance! I think you are exactly right that the crucial aspect is the variance when looking at out-of-sample predictions, not just across several samplings from the original training distribution (a la k-fold crossvalidation).