Hacker News new | ask | show | jobs
by jules 4067 days ago
There is quite a big difference between statistics and machine learning. A lot of the most successful machine learning algorithms do not have a statistical grounding or did not when they were invented. E.g. neural networks, SVM, low rank matrix approximation, k-means, decision trees/forests. Statistics is one of the tools in the machine learning toolbox.
2 comments

SVMs were invented by a couple statisticians/mathematicians in the 60s. k-means also harkens back to the 60s, by mathematicians and control theorists. Decision Trees and Random forests were invented by a famous statistician, with the latter related to bootstrapping, a statstical technique. PCA and factor analysis, forms of or closely related to low rank matrix approximation, were pioneered in the early 1900s, by some of the most famous statisticians ever.
Something that was invented by a statistician is not necessarily statistics, and that certainly applies even more to something invented by a mathematician. I guess with a broad enough notion of statistics some of these would fall in the field of statistics, but if something does not use at least one probability distribution it's probably far fetched to classify it as statistics.

It would be a lot more fair to classify machine learning as a subfield of convex optimization. Yet even that classification does not quite fit, so it makes most sense to just accept that it's a separate field which uses techniques from statistics, convex optimization, computer science, and more.

But, to look at one example: neural networks. Neural networks may have been inspired by attempts to recreate the structure of the biological nervous system, but the way in which they are used commonly, e.g. "learning" via back-propagation, is really just a statistical regression for a gigantic equation with many free variables.

My preferred term is "predictive analytics," which I feel kind of straddles statistics and machine learning, and also serves as a nod to a common difference -- "statistical" methods often yield understanding, while "machine learning" methods are often opaque to human insight but yield predictions.

I feel annoyed with opaqueness of ML algorithms like neural networks. I hope ML doesn't unwittingly define itself as a field where machines learn, but humans may not learn. I'm referring to predicaments like the story about 42 from hitchhikers guide to the galaxy.
That's definitively an interesting problem. Just note in many cases we're not even interested in learning the tasks. For example, you don't need any person to actually know that consumers aged 25-29 years old prefer a certain product 10% more than consumers aged 21-25, and so on.

But humans are still the ones responsible for important high level decisions, so it still makes sense to maximize information transparency to enable good decisions in those contexts.

A neural network that given a prediction 'X is most likely' and could answer the question "Why?" with 'Because Y' would be amazing.