Hacker News new | ask | show | jobs
by kvathupo 1907 days ago
As currymj commented, this isn't accurate for ML, only for classical statistics.

In ML (or more specifically deep learning), we make no distribution-based assumptions, other than the fundamental assumption that our training data is "distributed like" our test data. Thus, there aren't issues with fat-tailed distributions since we make no such normality assumptions. Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.

I suppose you could say statistics is less "empirical" than ML in the sense that it is axiom-based, whether that is a normality assumption of predictions about a regression line or stock prices following a Wiener process. By contrast, ML is less rationalist by simply reflecting data.

6 comments

It is absolutely untrue that DL is immune to fat-fail problems, and it is important that no one operate mission critical systems under this assumption.

The two fat tail questions one has to engage are:

- is it possible that a catastrophic input might be lurking in the wild that would not be present in a typical training set? Even with a 1M instance training set, a one-in-a-million situation will only appear (and affect your objective function) on average one time, and could very well not appear at all.

- can I bound how badly I will suffer if my system is allowed to operate in the wild on such an input?

DL gives no additional tools to engage these questions.

> It is absolutely untrue that DL is immune to fat-fail problems

In fact, working on fat tail problems is currently a hot topic in ML.

I don't quite follow: is not what you described a flaw fundamental to all forecasting; that is, the occurrence of a gross outlier? I should clarify that DL doesn't suffer from the same problem the normality condition has on fat-tails: a failure to capture the skew of the distribution.
It's not characteristic of all forecasting, only purely empirical forecasting.

Definitionally, the only way to reason about risk that doesn't appear in training data is non-empirical (e.g. a priori assumptions about distributions, or worst cases, or out-of-paradigm tools like refusing to provide predictions for highly non-central inputs).

DL is not any better (or worse) than any other purely empirical method at answering questions about fat-tail risk, and the only way to do better is to use non-empirical/a-priori tools. Obviously the tradeoff here is that your a priori assumptions can be wrong, and that too needs to be included in your risk model (see e.g. Robust Optimization / Robust Control).

I think it's wrong to assume that non-empirical methods can be reliably trusted to give better results. Humans are terrible at avoiding bias or evaluating risks, especially for uncommon events.
Food for thought: if every method for predicting event x is terrible, then you might as well not try to predict x and build your life in such way that you never expose yourself to the risk of x happening.
From a Bayesian point of view, that amounts to a "prediction" that the probability of event x is so significant that you should build your life around it. But I guess if you knew enough for that sentence to make sense you wouldn't have posted your comment. So, suffice it to say that Bayesian decision theory cuts the knot you're talking about.
I agree that ML tends to put weaker assumptions on the data than classical statistics and that it's a good thing.

However most ML certainly makes distributional assumptions - they are just weaker. When you're learning a huge deep net with an L2 loss on a regression task, you have a parametric conditional gaussian distribution under the hood. It's not because it's overparametrized that there's no distributional assumption. Vanilla autoencoders are also working under a multivariate gaussian setup as well. Most classifiers are trained under a multinomial distribution assumption etc.

And fat-tailed distributions are definitely a thing. It's just less of a concern for the mainstream CV problems on which people apply DL.

> In ML (or more specifically deep learning), we make no distribution-based assumptions, other than the fundamental assumption that our training data is "distributed like" our test data.

Okay, so that's about the same as classical statistics. You're just waiving the requirement to know what the distribution is. You are still assuming there exists a distribution and that it holds in the future when you apply the model. Sure you may not be trying to estimate parameters of a distribution, but it is still there and all standard statistical caveats still apply.

> Indeed, with the use of autoencoders, we don't assume a single distribution, but rather a stochastic process.

Classical statistics frequently makes use of multiple distrutions and stochastic processes.

Of course there's a distribution behind the data. The parent commenter was saying not all machine learning techniques need to know that distribution, as a refute to their parent comment.
I know what they're saying, I even reiterate it in my second sentence. My point is that doesn't protect you from the distribution changing, which is a problem that applies to machine learning and classical statistics.

This is in support of the GP comment: while you can loosen your assumptions about what the underlying distribution is and don't literally need to know it, you can't get away from the fundamental limitations of statistics. Which is the original topic we're talking about.

I dunno, there are definitely distribution-based assumptions—good luck working with skewed data. Most old-school techniques are kinda additive, so nobody's really been assuming a single distribution for practical applications.

Current ML techniques just work well for the kinds of problems people are applying them to, which is kind of a tautology. We should definitely seek to understand the theory behind stuff like dropout and not consider our lack of understanding a strength.

> I suppose you could say statistics is less "empirical" than ML in the sense that it is axiom-based, whether that is a normality assumption of predictions about a regression line or stock prices following a Wiener process. By contrast, ML is less rationalist by simply reflecting data.

I don't think that's true (or maybe I misunderstood?), I guess your comment "simply reflecting data" means fitting data with a very flexible function (curve)? There are very flexible distributions to fit almost any kind of data e.g https://en.wikipedia.org/wiki/Gamma_distribution or with a composition of them, but as a practitioner you still need to interpret the model and check if it does represent the underlying process well. Both statistical inference and ML are getting there using different methods.

The only reason that this may not be accurate for ML is because machine learners generally make no attempt to quantify their uncertainty in their predictions with e.g. confidence intervals or prediction intervals.

And there is a whole field of non-parametric statistics that doesn't make distribution assumptions.