| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by formulaT 4075 days ago
	My point was that a book with the title "Foundations of Data Science" should be mostly probability and statistics. Undergrad probability and linear algebra is not a solid foundation in statistics. Statistics is a powerful lens through which to view all data science. E.g. supervised learning is building a model of the conditional probability P(y\|x). Again, I am biased, but I think that methods that do not have some statistical interpretation are unlikely to be useful. E.g. if we take the graph of Facebook users and apply some matrix decomposition algorithm, who cares? What can we do with this decomposition? What does it predict?

3 comments

jules 4075 days ago

While I would like to agree with this based on aesthetics, it didn't work that way in practice. A lot of the most successful machine learning algorithms do not have a statistical grounding or did not when they were invented. E.g. neural networks, SVM, low rank matrix approximation, k-means, decision trees/forests.

link

formulaT 4074 days ago

Yes, the core algorithms do have this tendency (which is the exciting thing about machine learning per se), but statistics provides the context to understand them.

E.g. people used to say "neural networks are a simple, flexible functional form for y = f(X,theta)". This turned out to be wrong: SGD training of neural networks has more advantages than the flexibility of the functional form. But it was a good hypothesis and starting point.

SVMs and decision trees have no statistical justification I know of. Low rank matrix approximation and k-means are justified by latent variables and non-parametric kernel methods respectively. I agree these justifications came after the fact, but they do give a way to understand how these models work.

Most importantly, all of the small tasks surrounding training a model are purely statistical, e.g. cross validation, different measures of accuracy, handling endogenous variables, etc.

link

mbq 4075 days ago

There are two camps, one for which it is all about points, and second for which it is all about odds. When failing, first one fails at validation, while the second at performance. Obviously both mostly use the same tools and tricks, though with different ideologies, and overall produce comparable results.

link

brational 4075 days ago

> Undergrad probability and linear algebra is not a solid foundation in statistics.

Are you trying to argue that statistics is its own field and not built upon mathematics?

link

formulaT 4075 days ago

Clarification: probability and linear algebra are the foundations of statistics. Statistics is the foundation of machine learning.

If I have not studied statistics I will not have a solid foundation in statistics, even if I have studied probability and linear algebra.

link