Hacker News new | ask | show | jobs
by carlosf 1902 days ago
I am increasingly worried with people applying ML in everything without any rigour.

Statical inference generally only works well in very specific conditions:

1 - You know the distribution of the phenomenon under study (or make an explicit assumption and assume the risk of being wrong)

2 - Using (1), you calculate how much data you need so you get an estimation error below x%

Even though most ML models are essentially statistics and have all the same limitations (issues with convergence, fat tailed distributions, etc...) it seems the industry standard is to pretend none of that exists and hope for the best.

IMO the best moneymaking opportunities in the decade will involve exploiting unsecured IOT devices and naive ML models, we will have plenty of those.

11 comments

As currymj commented, this isn't accurate for ML, only for classical statistics.

In ML (or more specifically deep learning), we make no distribution-based assumptions, other than the fundamental assumption that our training data is "distributed like" our test data. Thus, there aren't issues with fat-tailed distributions since we make no such normality assumptions. Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.

I suppose you could say statistics is less "empirical" than ML in the sense that it is axiom-based, whether that is a normality assumption of predictions about a regression line or stock prices following a Wiener process. By contrast, ML is less rationalist by simply reflecting data.

It is absolutely untrue that DL is immune to fat-fail problems, and it is important that no one operate mission critical systems under this assumption.

The two fat tail questions one has to engage are:

- is it possible that a catastrophic input might be lurking in the wild that would not be present in a typical training set? Even with a 1M instance training set, a one-in-a-million situation will only appear (and affect your objective function) on average one time, and could very well not appear at all.

- can I bound how badly I will suffer if my system is allowed to operate in the wild on such an input?

DL gives no additional tools to engage these questions.

> It is absolutely untrue that DL is immune to fat-fail problems

In fact, working on fat tail problems is currently a hot topic in ML.

I don't quite follow: is not what you described a flaw fundamental to all forecasting; that is, the occurrence of a gross outlier? I should clarify that DL doesn't suffer from the same problem the normality condition has on fat-tails: a failure to capture the skew of the distribution.
It's not characteristic of all forecasting, only purely empirical forecasting.

Definitionally, the only way to reason about risk that doesn't appear in training data is non-empirical (e.g. a priori assumptions about distributions, or worst cases, or out-of-paradigm tools like refusing to provide predictions for highly non-central inputs).

DL is not any better (or worse) than any other purely empirical method at answering questions about fat-tail risk, and the only way to do better is to use non-empirical/a-priori tools. Obviously the tradeoff here is that your a priori assumptions can be wrong, and that too needs to be included in your risk model (see e.g. Robust Optimization / Robust Control).

I think it's wrong to assume that non-empirical methods can be reliably trusted to give better results. Humans are terrible at avoiding bias or evaluating risks, especially for uncommon events.
Food for thought: if every method for predicting event x is terrible, then you might as well not try to predict x and build your life in such way that you never expose yourself to the risk of x happening.
I agree that ML tends to put weaker assumptions on the data than classical statistics and that it's a good thing.

However most ML certainly makes distributional assumptions - they are just weaker. When you're learning a huge deep net with an L2 loss on a regression task, you have a parametric conditional gaussian distribution under the hood. It's not because it's overparametrized that there's no distributional assumption. Vanilla autoencoders are also working under a multivariate gaussian setup as well. Most classifiers are trained under a multinomial distribution assumption etc.

And fat-tailed distributions are definitely a thing. It's just less of a concern for the mainstream CV problems on which people apply DL.

> In ML (or more specifically deep learning), we make no distribution-based assumptions, other than the fundamental assumption that our training data is "distributed like" our test data.

Okay, so that's about the same as classical statistics. You're just waiving the requirement to know what the distribution is. You are still assuming there exists a distribution and that it holds in the future when you apply the model. Sure you may not be trying to estimate parameters of a distribution, but it is still there and all standard statistical caveats still apply.

> Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.

Classical statistics frequently makes use of multiple distrutions and stochastic processes.

Of course there's a distribution behind the data. The parent commenter was saying not all machine learning techniques need to know that distribution, as a refute to their parent comment.
I know what they're saying, I even reiterate it in my second sentence. My point is that doesn't protect you from the distribution changing, which is a problem that applies to machine learning and classical statistics.

This is in support of the GP comment: while you can loosen your assumptions about what the underlying distribution is and don't literally need to know it, you can't get away from the fundamental limitations of statistics. Which is the original topic we're talking about.

I dunno, there are definitely distribution-based assumptions—good luck working with skewed data. Most old-school techniques are kinda additive, so nobody's really been assuming a single distribution for practical applications.

Current ML techniques just work well for the kinds of problems people are applying them to, which is kind of a tautology. We should definitely seek to understand the theory behind stuff like dropout and not consider our lack of understanding a strength.

> I suppose you could say statistics is less "empirical" than ML in the sense that it is axiom-based, whether that is a normality assumption of predictions about a regression line or stock prices following a Wiener process. By contrast, ML is less rationalist by simply reflecting data.

I don't think that's true (or maybe I misunderstood?), I guess your comment "simply reflecting data" means fitting data with a very flexible function (curve)? There are very flexible distributions to fit almost any kind of data e.g https://en.wikipedia.org/wiki/Gamma_distribution or with a composition of them, but as a practitioner you still need to interpret the model and check if it does represent the underlying process well. Both statistical inference and ML are getting there using different methods.

The only reason that this may not be accurate for ML is because machine learners generally make no attempt to quantify their uncertainty in their predictions with e.g. confidence intervals or prediction intervals.

And there is a whole field of non-parametric statistics that doesn't make distribution assumptions.

I agree -- as ML becomes increasingly easy to be applied by non-experts or people without a heavy math/stats background, I've seen an increasing volume of arguments against the data science profession (someone the other day called DS the "gate-keepers") but: there be dragons.

Anyone can use SOTA deep learning models today, but in my experience, it's more important to understand the answer to "what are the shortcomings/consequences of using a particular method to solve this problem?" "what is (or could be) biases in this dataset?", etc. It requires a non-trivial understanding of the underlying methodology and statistics to reliably answer these questions (or at least worry about them).

Can you apply deep reinforcement learning to your problem? Maybe. Should you? Well, it depends, and you should understand the pros and cons, which requires more than just the knowledge of how to make API calls. There are consequences to misusing ML/AI, and they may not even be obvious from offline testing and cross validation.

Personally, I think the main problem with ML is simpler: it works well for interpolation, and is crap for extrapolation.

If the outputs you want are well within the bounds of your training data set, ML can do wonders. If they aren't, it'll tell you that in 20 years everyone will be having -0.2 children and all the other species on the planet will start having to birth human babies just so they can be thrown into the smoking pit of bad statistical analysis.

I agree, but that's equivalent to my original claim.

Being bad at extrapolation is a consequence of assuming all training data can describe your phenomena distribution and being wrong.

Outside of simple time series, I'm not aware of any good way to extrapolate.
One way to extrapolate is to use a mechanistic or semi-mechanistic model. The recent advances in neural differential equations are a really cool example of this
> If they aren't, it'll tell you that in 20 years everyone will be having -0.2 children and all the other species on the planet will start having to birth human babies just so they can be thrown into the smoking pit of bad statistical analysis.

https://xkcd.com/605/

i think this actually gets at what makes applied ML distinct from statistics as a practice, even though there is a ton of overlap.

statisticians make assumptions 1 and 2, and think of themselves as trying to find the "correct" parameters of their model.

people doing applied ML typically assume they don't know 1 (although they might implicitly make some weak assumptions like sub-gaussian to avoid fat tails, etc.) and also typically don't care about being able to do 2. and they don't care about their parameters; in a sense to an ML practitioner, every parameter is a nuisance parameter.

instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

but you are right that in the face of a shifting distribution or an adversary crafting bad inputs, ML models can break down -- but there is actually a lot of research on ways to deal with this, which will hopefully reach industry sooner rather than later.

> instead you assume you have some reliable way of evaluating performance on the task you care about -- usually measuring performance on an unseen test set. as long as this is actually reliable, then things are fine.

This is the part that often fails in practice. Think of all the benchmarks that show superhuman performance and compare that to how good those same models really aren't. Constructing a good set of holdouts to evaluate on is really hard and gets back to similar issues. In practice, doing what you're describing reliably (in a way that actually implies you should have confidence in your model once you roll it out) is rarely as simple as holding out some random bit of your dataset out and checking performance on it.

On the other hand, what you often see is people just holding out a random bunch of rows.

I disagree, every ML model has some implicit statistical assumption, which is often not well understood by practitioners.

At minimum you must assume your underlying process is not fat tailed. If it is, then your training/validation/test data might never be enough to make reliable predictions and your model might break constantly in prod.

BTW shifting distributions and fat tailed distributions are sort of equivalent, at least mathematically.

I don't disagree with any of that, but I still think a responsible, clear-thinking ML practitioner can avoid having to assume the form of the data-generating process, depending on their application.

In some cases if you care about PAC generalization bounds, it's even the case that the bounds do actually hold for all possible distributions.

I think it's more meaningful to have the discussion in a specific problem domain since statistical inference or ML are just tools to better model a problem / phenomenon. The domain (prior) knowledge -- everything else that's not stats / ML, are the keys to build a more robust model. Leave the problem domain out we are left just with pure mathematical theories and the points can only be proved by simulated data.
Yes - this is pretty much exactly how I explain the difference between machine learning and statistics.

Despite using similar models, the expertise required for 'doing statistics' (statistical inference) is actually very different from machine learning. Machine learning fits into the 'hacker mentality' well - try stuff out see what works. To do statistical inference effectively, you really do need to spend time learning the theory. They both require deep skills - but the skills are surprisingly different considering it's often the same underlying model.

But without some statistical knowledge, isn’t there a risk of a lack of understanding about the robustness of “what works”?
Statistical knowledge doesn’t remove that risk. The extent to which it even lowers the risk is a question that could be answered empirically.
yeah, agreed - a good understanding of the model's statistical assumptions can often help you make the model more robust and also give you ideas for what types of feature engineering are likely to work.
"Every parameter is a nuisance parameter" is a great way to put it.
ML looks (for many peole) like a way to circunvent your grumpy statiscian saying that the underlying data is worthless and/or you should focus on getting the data pipeline done properly for a logit model on your churn rate.
"Scientist free science," -- being able to optimize systems without understanding them, has been a dream of the business world since the dawn of time. There's always been a market for cookbook recipes that automate the collection of data, and interpretation of results. Before ML, there were "design of experiments," and "statistical quality control."
>Before ML, there were "design of experiments," and "statistical quality control."

Statistical quality control, at least the way I know it, is very useful in finding problems in your process. I'm also not sure how this fits with your premise. It's about optimizing systems by first finding out where to look, and then looking there in detail with expert knowledge, i.e. deep understanding of your system.

I'm definitely with you there, but I've also seen the side of it where it turns into a cargo cult and runs headlong into the replication crisis.

Perhaps the good thing is that as the new things gain popular attention, the old techniques such as SPC are under less pressure to support success theater, and revert to being actual useful, solid tools.

Isn't the point of ML exactly that you don't know the underlying distribution? How is this ever assumed in any way? ML is not parametric statistics.
(Some) ML is non-parametric, but there are always some questions you need to be able to answer about your data. At bare minimum, is the generating process ergodic, what is the error of your measurement procedure, how representative of the true underlying distribution is your sampling procedure? All use of data should start with some exploratory analysis before you ever get to the modeling stage.

Once you have a model, at minimum understand how to tune for the tradeoffs of different types of error and don't naively optimize for pure accuracy. At the obvious extremes, if you're trying to prevent nuclear attack, false negatives are much more costly than false positives, if you're trying to figure out whether to execute someone for murder, false positives are much more costly than false negatives. Understand the relative costs of different types of error for whatever you're trying to predict and proceed accordingly.

Well, all optimization problems are equivalent to a maximum likelihood estimate for a corresponding probability distribution so you may make more implicit assumptions than you think.

Typical ML methods just have a huge distribution space that can fit almost anything from which they pick just 1 option. This has two downsides:

Since your distribution space is several times too large by design you lose the ability to say anything useful about the accuracy of your estimate, other than that it is not the only option by far.

Since you must pick 1 option from your parameter space you may miss slightly less likely explanations that may still have huge consequences, which means your models tend to end up overconfident.

I mean yes, there is parametric ML (maximum likelihood, MAP, GMMs, ...) and there is non-parametric ML (everything neural network, SVM, GBM, random forrests, ...).

I'd argue that the latter had bigger success in the past since the prior on the data distribution is usually wrong in real life. Think about a prior for image data distributions or the same in nlp. Forget about it.

[Disclosure: I'm an IBMer - not involved with this work]

With regard to exploitation, IBM research has done some interesting work in the form of an open source "Adversarial Robustness Toolbox" [0]. "The open source Adversarial Robustness Toolbox provides tools that enable developers and researchers to evaluate and defend machine learning models and applications against the adversarial threats of evasion, poisoning, extraction, and inference."

It's fascinating to think through how to design the 2nd and 3rd order side-effects using targeted data poisoning to achieve a specific outcome. Interestingly, poisoning could be to force a specific outcome for a one-time gain (e.g. feed data in a way to ultimately trigger an action that elicits some gain/harm) or to alter the outcomes over a longer time horizon (e.g. Teach the bot to behave in a socially unacceptable way)

[0] https://art360.mybluemix.net/

The problem is high dimensions knowing the distribution or even characterizing it fully with data is incredibly difficult (curse of dimensionality). I think the real assumption in ML is just that there is some low dimensional space that characterizes the data well and ML algorithms find these directions where the data is constant.
Wait until you find out low many studies have been published in medical journals with serious statistical flaws.
> 1 - You know the distribution of the phenomenon under study (or make an explicit assumption and assume the risk of being wrong)

Nonparametric methods say 'hi'.

> You know the distribution of the phenomenon under study

If you know the distribution of the phenomenon under study you dont need ML, that is what probability is for.

> or make an explicit assumption and assume the risk of being wrong

No.You have the Bias/Variance tradeoff here.You can make an explicit assumption about your model or not.

> Using (1), you calculate how much data you need so you get an estimation error below x%

This is extremely complicated for anything except the most trivial toy examples, probably not solvable at all and definitely not the way biological intelligent systems (aka some humans) do it.