| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by currymj 1901 days ago

i think this actually gets at what makes applied ML distinct from statistics as a practice, even though there is a ton of overlap.

statisticians make assumptions 1 and 2, and think of themselves as trying to find the "correct" parameters of their model.

people doing applied ML typically assume they don't know 1 (although they might implicitly make some weak assumptions like sub-gaussian to avoid fat tails, etc.) and also typically don't care about being able to do 2. and they don't care about their parameters; in a sense to an ML practitioner, every parameter is a nuisance parameter.

instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

but you are right that in the face of a shifting distribution or an adversary crafting bad inputs, ML models can break down -- but there is actually a lot of research on ways to deal with this, which will hopefully reach industry sooner rather than later.

4 comments

6gvONxR4sf7o 1901 days ago

> instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

This is the part that often fails in practice. Think of all the benchmarks that show superhuman performance and compare that to how good those same models really aren't. Constructing a good set of holdouts to evaluate on is really hard and gets back to similar issues. In practice, doing what you're describing reliably (in a way that actually implies you should have confidence in your model once you roll it out) is rarely as simple as holding out some random bit of your dataset out and checking performance on it.

On the other hand, what you often see is people just holding out a random bunch of rows.

link

carlosf 1901 days ago

I disagree, every ML model has some implicit statistical assumption, which is often not well understood by practitioners.

At minimum you must assume your underlying process is not fat tailed. If it is, then your training/validation/test data might never be enough to make reliable predictions and your model might break constantly in prod.

BTW shifting distributions and fat tailed distributions are sort of equivalent, at least mathematically.

link

currymj 1901 days ago

I don't disagree with any of that, but I still think a responsible, clear-thinking ML practitioner can avoid having to assume the form of the data-generating process, depending on their application.

In some cases if you care about PAC generalization bounds, it's even the case that the bounds do actually hold for all possible distributions.

link

dumb1224 1901 days ago

I think it's more meaningful to have the discussion in a specific problem domain since statistical inference or ML are just tools to better model a problem / phenomenon. The domain (prior) knowledge -- everything else that's not stats / ML, are the keys to build a more robust model. Leave the problem domain out we are left just with pure mathematical theories and the points can only be proved by simulated data.

link

RobinL 1901 days ago

Yes - this is pretty much exactly how I explain the difference between machine learning and statistics.

Despite using similar models, the expertise required for 'doing statistics' (statistical inference) is actually very different from machine learning. Machine learning fits into the 'hacker mentality' well - try stuff out see what works. To do statistical inference effectively, you really do need to spend time learning the theory. They both require deep skills - but the skills are surprisingly different considering it's often the same underlying model.

link

nickforr 1901 days ago

But without some statistical knowledge, isn’t there a risk of a lack of understanding about the robustness of “what works”?

link

splithalf 1901 days ago

Statistical knowledge doesn’t remove that risk. The extent to which it even lowers the risk is a question that could be answered empirically.

link

RobinL 1901 days ago

yeah, agreed - a good understanding of the model's statistical assumptions can often help you make the model more robust and also give you ideas for what types of feature engineering are likely to work.

link

QuesnayJr 1901 days ago

"Every parameter is a nuisance parameter" is a great way to put it.

link