Hacker News new | ask | show | jobs
by currymj 1901 days ago
i think this actually gets at what makes applied ML distinct from statistics as a practice, even though there is a ton of overlap.

statisticians make assumptions 1 and 2, and think of themselves as trying to find the "correct" parameters of their model.

people doing applied ML typically assume they don't know 1 (although they might implicitly make some weak assumptions like sub-gaussian to avoid fat tails, etc.) and also typically don't care about being able to do 2. and they don't care about their parameters; in a sense to an ML practitioner, every parameter is a nuisance parameter.

instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

but you are right that in the face of a shifting distribution or an adversary crafting bad inputs, ML models can break down -- but there is actually a lot of research on ways to deal with this, which will hopefully reach industry sooner rather than later.

4 comments

> instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

This is the part that often fails in practice. Think of all the benchmarks that show superhuman performance and compare that to how good those same models really aren't. Constructing a good set of holdouts to evaluate on is really hard and gets back to similar issues. In practice, doing what you're describing reliably (in a way that actually implies you should have confidence in your model once you roll it out) is rarely as simple as holding out some random bit of your dataset out and checking performance on it.

On the other hand, what you often see is people just holding out a random bunch of rows.

I disagree, every ML model has some implicit statistical assumption, which is often not well understood by practitioners.

At minimum you must assume your underlying process is not fat tailed. If it is, then your training/validation/test data might never be enough to make reliable predictions and your model might break constantly in prod.

BTW shifting distributions and fat tailed distributions are sort of equivalent, at least mathematically.

I don't disagree with any of that, but I still think a responsible, clear-thinking ML practitioner can avoid having to assume the form of the data-generating process, depending on their application.

In some cases if you care about PAC generalization bounds, it's even the case that the bounds do actually hold for all possible distributions.

I think it's more meaningful to have the discussion in a specific problem domain since statistical inference or ML are just tools to better model a problem / phenomenon. The domain (prior) knowledge -- everything else that's not stats / ML, are the keys to build a more robust model. Leave the problem domain out we are left just with pure mathematical theories and the points can only be proved by simulated data.
Yes - this is pretty much exactly how I explain the difference between machine learning and statistics.

Despite using similar models, the expertise required for 'doing statistics' (statistical inference) is actually very different from machine learning. Machine learning fits into the 'hacker mentality' well - try stuff out see what works. To do statistical inference effectively, you really do need to spend time learning the theory. They both require deep skills - but the skills are surprisingly different considering it's often the same underlying model.

But without some statistical knowledge, isn’t there a risk of a lack of understanding about the robustness of “what works”?
Statistical knowledge doesn’t remove that risk. The extent to which it even lowers the risk is a question that could be answered empirically.
yeah, agreed - a good understanding of the model's statistical assumptions can often help you make the model more robust and also give you ideas for what types of feature engineering are likely to work.
"Every parameter is a nuisance parameter" is a great way to put it.