Hacker News new | ask | show | jobs
by cbgb 4075 days ago
Just for the record, from the first page: "Background material needed for an undergraduate course has been put in the appendix. For this reason, the appendix has homework problems."

The appendix covers Probability and Linear Algebra.

1 comments

My point was that a book with the title "Foundations of Data Science" should be mostly probability and statistics. Undergrad probability and linear algebra is not a solid foundation in statistics.

Statistics is a powerful lens through which to view all data science. E.g. supervised learning is building a model of the conditional probability P(y|x). Again, I am biased, but I think that methods that do not have some statistical interpretation are unlikely to be useful. E.g. if we take the graph of Facebook users and apply some matrix decomposition algorithm, who cares? What can we do with this decomposition? What does it predict?

While I would like to agree with this based on aesthetics, it didn't work that way in practice. A lot of the most successful machine learning algorithms do not have a statistical grounding or did not when they were invented. E.g. neural networks, SVM, low rank matrix approximation, k-means, decision trees/forests.
Yes, the core algorithms do have this tendency (which is the exciting thing about machine learning per se), but statistics provides the context to understand them.

E.g. people used to say "neural networks are a simple, flexible functional form for y = f(X,theta)". This turned out to be wrong: SGD training of neural networks has more advantages than the flexibility of the functional form. But it was a good hypothesis and starting point.

SVMs and decision trees have no statistical justification I know of. Low rank matrix approximation and k-means are justified by latent variables and non-parametric kernel methods respectively. I agree these justifications came after the fact, but they do give a way to understand how these models work.

Most importantly, all of the small tasks surrounding training a model are purely statistical, e.g. cross validation, different measures of accuracy, handling endogenous variables, etc.

There are two camps, one for which it is all about points, and second for which it is all about odds. When failing, first one fails at validation, while the second at performance. Obviously both mostly use the same tools and tricks, though with different ideologies, and overall produce comparable results.
> Undergrad probability and linear algebra is not a solid foundation in statistics.

Are you trying to argue that statistics is its own field and not built upon mathematics?

Clarification: probability and linear algebra are the foundations of statistics. Statistics is the foundation of machine learning.

If I have not studied statistics I will not have a solid foundation in statistics, even if I have studied probability and linear algebra.