Hacker News new | ask | show | jobs
by bitL 3495 days ago
An honest question - do we even need statistics when we have machine learning? Statistics to me appears as a hack/aggregation of data we couldn't process at once in the past; these days ML + Big Data can achieve that and instead of statistics we can do computational inference instead. To me this looks like looking back to "old ways" for a reference point instead of looking forward to the unknown but more exciting.
5 comments

Sorry you're getting down-voted, I don't think it's an unreasonable question.

In the sense I think you're using it, "statistics" are really methods for dimensionality reduction - we take means, and medians and standard deviations with the hopes that they will capture the parts of the data we care about. This is important for two reasons - for one, for anything even moderately high dimension we'll never have enough data to be able to forego some means of aggregation due to the "curse of dimensionality". Secondly, the human-machine interaction information bandwidth is annoyingly low, so we need some way to compress any information for human consumption. "Statistics" are one way we do so.

"Statistics" is also a field of study based around understanding how multiple data points relate to each other - that is of course critical to machine learning, and I think the terminology collision is why you're getting downvoted.

Machine learning is often considered a statistical technique. The main difference seems to be that in traditional statistics, people derive practice from theory, whereas in ML people will try out techniques and figure out the theory later. That's really just a cultural difference. The techniques for analyzing ML models are all statistical to begin with.

Statistics, as a field, already used general-purpose optimization algorithms before modern ML techniques came about, so in that sense, ML just fits into an existing position in the statistical toolbox (like replacing a chisel with a 3D printer). In the other direction, statistical techniques like cross-validation are necessary for you to get your ML correct.

There is much more in ML than just statistics. I was basically asking why the "statistics filter" is so often on in ML. Neural networks don't seem a statistical technique, even if somebody uses them for regression. Yes, there is an overlap, but no, ML != statistics. As you mentioned, non-linear optimization is used in statistics on meta-level however nobody claims statistics is operations research or vice versa.
I agree with this comment because ML comprises some techniques that are still not understood [1].

I would like to think that statistics comes more from a pure math approach, loosely, while ML comes from an applied math approach, loosely. ML works spectacularly well on a class of problems. Why it does what it does is (in)conveniently brushed under the rug. How you treat that (in)convenience is left to you.

[1] - https://www.quantamagazine.org/20151203-big-datas-mathematic...

In what way is neural network not a statistical technique ?

very curious to hear about your point of view. Statistics is not linear regression and ANOVA, or whatever catalogue of techniques in a freshman book, not even the library of techniques available in R.

Statistics is the application of probability (or more broadly, math) to data.

That said statisticians did miss the neural net wave because of their flippant reaction to it. They said, "oh well yet another non-parametric function approximator we have worked out the asymptotics 30 years ago".

To paraphrase someone wise: asymptotically we are all dead. Not enough heed was paid to that. Among their other lacks were expertise in algorithms and optimization. Mind you, optimization has been at the core of their craft from their very genesis, its just that they did not feel it important enough to ride the cutting edge of research on optimization. Note: you cannot do maximum likelihood with solving an optimization problem. Gauss was doing it several hundreds of years ago for statistics. If I go on a bit further with my rant, they got a bit carried away with their fetish over bias and asymptotic normality. They missed the wave, sure.

But all said and done, by any accepted definition of statistics, NN is very much also statistics.

Statisticians are either Frequentists, or Bayesianists. Fundamental to both of these approaches is the involvement of probability. A technique that does not have a probabilistic interpretation is not a statistical technique. This is also largely related to the difference in goals between statistics and machine learning: statistics is primarily about inference, and machine learning is primarily about prediction. You can predict without necessarily making any meaningful probabilistic statements about data. There's not really much to infer without saying something probabilistic.
Neural nets have absolutely been associated with probalistic interpretations. Good resources would be David McCay and Radford Neal. Both their approaches are Bayesian. A far more trivial way to associate a probabistic interpretation is to claim that the neural net is the conditional expectation.

And who says frequentist and Bayesian are the only two views. Where would you shelve prequential statistics then ? Or nonparametric regression

Prediction has definitely been a part of statistics but often, as you rightly claim, as a byproduct. And yes i would characterize the focus of stats and ml exactly as you did

The point is not whether somebody has ever tried to associate neural nets with probability, sorry if my previous comment made it seem that way. The point is that neural nets are not fundamentally tied to it. You can try to tie them to probability, but you certainly don't have to, it mostly isn't, it isn't mostly taught that way, and the big open problems in the field don't involve it.

Statistics works the other way. You basically always start with some kind of probabilistic model. And then, if you even bother with prediction, you work towards prediction from the probabilistic model. With stats you don't need to interpret or add probability after the fact, it's already there.

Obviously stats and ML are enormous fields, with quite some overlap. And people tend to go after low hanging fruit; if many people who studied neural nets have formal probability backgrounds it simply makes sense that someone will write a paper on it. And I'm generalizing here (same goes with frequentist & Bayesian comment). But there absolutely is justification for saying "neural nets are not really a statistical technique".

But isn't prediction by itself a probabilistic concept? Isn't one interested in the confidence of the prediction?
Yes we still need statistics. There is a huge overlap between machine learning methods and applied statistics, so much so that often there is not a clear distinction between the two.
> between machine learning methods and applied statistics (...) often there is not a clear distinction between the two.

I would say applied statistics draws a line just prior to implementation concerns (say, real-world resource usage measured in time, space and energy) whereas these would be fully within scope and of interest in machine learning.

As an example, applied statistics could provide a useful approach to a vision/image recognition problem, and this approach might be provably unrealizable in practice using real-world execution units (e.g. CUDA cores). Nonetheless, it might still be a very worthwhile theoretical result in applied statistics, although of no immediate interest within ML except to hint at potential new area of research.

It may not be true for all branches of machine learning (fuzzy logic, for example?), but the vast majority of modern ML techniques are equivalent to or can be viewed as types of statistical machine learning.
Good point about the fuzzy logic, often the boundaries between it and statistics are... fuzzy
ML + Big Data are a specific application of statistics

To to do anything beyond use tools other people have made (and never be sure whether results are meaningful or not) statistics are required

Of course, to make money from the ML boom you can probably get away with coincidence and correlation

Statistics means aggregate stuff and uses simplified characteristics out of semi-structured data. ML + Big Data allows you to ask precise questions like Where? How? Which ones?
As user "highd" suggested, I think you are confusing two words. I refer to Wiktionary for definitions:

Statistics: A mathematical science concerned with data collection, presentation, analysis, and interpretation.

Statistic: A quantity calculated from the data in a sample, which characterises an important aspect in the sample (such as mean or standard deviation).

If "statistics" is the term for taking a mathematical approach to understanding data, then "machine learning" is basically an applied subset of that. But you seem to be specifically using the "a statistic" definition to describe what you think the "study of statistics" is entirely concerned with.