Hacker News new | ask | show | jobs
by srean 2903 days ago
Graycat and I have a history of conversations on HN regarding the differences between statistics and machine learning approaches.

ML is a term that came from Russian probabilists and statisticians. ML is applied probability, optimization and algorithms. A traditional statistician will be strong in the former but not necessarily in the other two.

One can of course take the 'what sticks to the wall approach' but it would be a mischaracterization to state that a principled approach does not exist in ML literature. There are theorems stating when to try one over the other and then what to expect.

The difference in stats and ML can be seen in the nature of the theorems in their body of literature.

Lets take the probabilistic setting of ML, because this matches with the assumption a sophisticated statistician would make. ML has techniques where you don't need a stochastic model, those settings are similar to adversarial game theoretic fundamentals that does not involve a stochastic sampling process.

The assumptions:

We have a joint distribution over random variables X,Y and have drawn a sample of size n from it. We have another sample from the same joint distribution but the Ys have been removed and we want to get those back.

The baby statistician's way:

God came to me in my dream and told me every thing about a function f(X, \theta) that would recover the ys. He conveyed infinite amount of information but left out a finite number of parameters (settings of knobs) for me to figure out. I was told that the specific \theta lives in the set \Theta.

The mature statistician's way:

Almost the same as above except that \theta would be living in an infinite dimensional space. In other words he can work with a less powerful god who leaves out infinite amount of information.

The theorems: A statistician would be interested in theorems where they show that their method manages to recover the \theta that god left out.

In other words they would like to find an estimator \hat{\theta} that for an infinitely sized initial sample converges to the true \theta. The crowning accomplishments are when these theorems prove that this happens with probability 1. Weaker ones are those that state that this happens with as high a probability one wants (but never quite reaching 1).

ML way:

A machine learner is not interested in theorems that characterize how close \hat{\theta} is to true \theta. Furthermore asymptotically we are all dead, so he is interested in finite sample results. His point would be that all this god mumbo-jumbo and \theta is a piece of fiction we invented anyway. No one will ever be able see these, even in principle, so he couldn't care less about those. He would care about theorems that show how the function value (and not its parameters) converges uniformly to the best function in a class of functions \cal F_n of his choosing and using finitely many samples. He would also be interested in finding a sequence of function classes so that one can consider richer and richer supersets of function classes as he get more and more data. The regularization business falls out of the function classes he wants to look at. A closely related statistical viewpoint would be those of sieves.

This is not an entirely correct characterization because there existed (and exists) probabilist and statisticians who were interested in prediction theorems rather than parameter recovery theorems, but they were far far fewer in number. An example would be Prof. Dawid.