Hacker News new | ask | show | jobs
by CuriouslyC 3223 days ago
You absolutely need a solid grounding in multi-variable calculus, linear algebra, probability theory and information theory. It will also be helpful to be well versed in graph theory.

In my opinion one of the best starting points is "Information Theory, Inference and Learning Algorithms" by David MacKaye. It's a bit long in the tooth now, but it is still one of the most approachable and well written books in the field.

Another old book that stands up very well is "Probability Theory: the Logic of Science" by E. T. Jaynes.

"Elements of Statistical Learning" by Tibshirani is also good.

"Bayesian Data Analysis" by Andrew Gelman is another great read.

"Deep Learning" by Ian Goodfellow and Yoshua Bengio is useful for getting caught up with recent advances in that field.

15 comments

Free PDFs of some of the books mentioned:

"Information Theory, Inference and Learning Algorithms" by David MacKaye

http://www.inference.org.uk/itprnn/book.pdf

"Probability Theory: the Logic of Science" by E. T. Jaynes

http://www.med.mcgill.ca/epidemiology/hanley/bios601/Gaussia...

"Elements of Statistical Learning" by Tibshirani

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

"Bayesian Data Analysis" by Andrew Gelman

http://hbanaszak.mjr.uw.edu.pl/TempTxt/(Chapman%20&%20Hall_C...

Note that only MacKay (that’s the correct spelling) and Hastie/Tibshirani/Friedman are legally available online.

edit: Goodfellow/Bengio/Courville, not mentioned in the previous comment, is also available online: http://www.deeplearningbook.org

I'm not super interested in ML but I am very interested in applied mathematics in computer science. I've got a fair bit of linear algebra due to cryptography, but have had virtually no need of any form of calculus (unless I'm relying on it without knowing it) in my career.

So beyond just saying that you'd need grounding in multivariable calculus to do serious ML work, I would be super interested in hearing more about why that is and what kinds of problems crop up in ML that demand it.

Calculus essentially discusses how things change smoothly and it has a very nice mechanism for talking about smooth changes algebraically.

A system which is at an optimum will, at that exact point, be no longer increasing or decreasing: a metal sheet balanced at the peak of a hill rests flat.

Many problems in ML are optimization problems: given some set of constraints, what choices of unknown parameters minimizes error? This can be very hard (NP-hard) in general, but if you design your situation to be "smooth" then you can use calculus and its very nice set of algebraic solutions.

You also need multivariate calculus because typically while you're only trying to minimize "error", you do so by changing many, many parameters at once. This means that you've got to talk about smooth changes in a high-dimensional space.

--

The other side of calculus is integration which talks about "measuring" how big things are. Most of probability is discussing very generalized ratios: of the total, "how big is this piece" is analogous to "what are the odds this will happen".

The general discussion of measure is complex and essentially the only tool to tackle it involves gigantic (infinite, really) sums of small, well-behaved pieces to form a complex whole.

It just happens to turn out (and this is the big secret of calculus) that this machinery (integration) is dual to the study of smooth changes and you can knock them both out together.

--

So ultimately, ML hinges upon being able to measure things (integration) and talk about how they change (derivation). Those two happen to be the same concept in a way and they are essentially what you study in calculus.

A lot of probability theory requires it. For instance, ML is largely framed mathematically as a series of optimisation problem, which are then solved by finding the gradient and performing gradient descent; this requires elementary calculus to calculate the gradient.

Additionally, if you want to calculate a probability given a density function, or evaluate an expectation, you need to calculate several integrals. This arises quite often in the theoretical sections of ML papers/textbooks.

The use of calculus in ML is probably similar to the use of number theory in crypto- you can do applied work fine without it, but you understand the work a lot better by knowing the math, and are less likely to make dumb mistakes.

Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.

If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.

For ML you just need Calculus 1 and 2. The curl/div and Stokes is Calculus 3 which a physics thing. You don't need that for ML.

You may need the basics of functional analysis in certain areas of ML, which is arguably Calculus 4.

Could not agree more .......

> Most of ML is fitting models to data. To fit a model you minimise some error measure as a function of its real valued parameters, e.g. the weights of the connections in a neural network. The algorithms to do the minimisation are based on gradient descent, which depends on derivatives, i.e. differential calculus.

> If you're doing Bayesian inference you're going to need integral calculus because Bayes' law gives the posterior distribution as an integral.

The most obvious thing is understanding back-propagation. Backprop is pretty much all partial derivatives / chain rule manipulations. Also a lot of machine learning involves convex optimization which entails some calculus.
Much of ML is optimization. This is linked to calculus by derivatives. There is the simple part that at a minimum or maximum the derivative is 0. However, more relevance comes from gradient descent. This depends very heavily on calculating derivatives, and its one of the most universal fast optimization methods.

Beyond that, for iterative methods, convergence is a matter of limits. This again is calculus. Formulating iteration as repeatedly applying a function, we converge to a fixed point of that function if and only if the derivative at that fixed point lies between -1 and 1. Again derivatives come in.

Finally, for error estimation, taylor-expansions are often useful. Again, the topic here is calculus. Notably, all I can think of regards limits and derivatives, not integrals. That might just be due to my hatred of integrals though.

I have a pretty good math background, but understanding K-L divergence ([0], a measure of the difference between two probability distributions) required revisiting some calculus. It's needed for understanding models with probabilistic output, used in both generative models and reinforcement learning.

[0] https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...

Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.

The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture (another option is to treat it as an adverserial game with nature). In the probabilistic approach, we make the assumption that functions being revealed to us is in some probabilistic proximity of the true function and the sample is closing onto it slowly. We have to be careful to be not too eager to model the revealed function, our goal is to optimize the function where these revealed functions are ultimately headed.

Those things aside, if you have to choose just one prereq, I think it has to be linear algebra and you already have that in your bag. Without it, a lot of multivariate calculus will not make much sense anyway. Then one can push things a little bit and go for the linear algebra where your vectors have infinte dimension. This becomes important because often your data would have far too much information that you can encode in a finite dimensional vector. Thankfully a lot of intution carries over to infinite dimension (except when it does not). This goes by the name functional analysis. Not absolutely essential, but then lack of intution here can rein you in from doing some certain kinds of work. You will just get a better (at times spatial or geometric) understanding of the picture, etc etc.

Other than theeir motivating narratives, there is not much difference btween probability/stats and information theory. There is a one to one mapping between many if not all of their core problems. A lot of this applies to signal processing too. Many of the problems that we are stuck at in these domains are the same. Sometimes a problem seems better motivated in one narrative over the other. Some will call it finding the best code for the source, others will call it parameter estimation, yet others will call it learning.

Or If I may paraphrase for the CS audience, blame the reals \mathbb{R}. Otherwise it would have been the problem of reverse engineering a noisy Turing machine that we can access only through its input and output. Pretty damn hard even if we dont get into reals. In those situations you could potentially get by without calculus, algebra by itslef should go a long way, but as I said it gets frigging hard. Learning even the lowly regular expression from examples is hard. Calculus would still be helpful because many combinatorial / counting prolems that come up can be dealt with generating function techniques where you would run into integral calculus with complex numbers.

Very well said ..

> Almost every corner of an ML problem has an optimization problem that needs to be solved: There is a function that you want to minimize subject to constraints. Typically these are everywhere smooth, or sometimes almost everywhere smooth. So calculus shows up in (i) algorithms to find the bottom of these functions (if they exist) or (ii) deriving the location of the minima in closed form. These functions would be "how close am I to the correct parameter", "What losses would these settings rake up on average" etc etc.

> The reason why this differs from a purely optimization / mathematical programming problem is that we can only approximately evaluate the actual function (the performance of our model on new / unseen data) that we care to optimize. Great optimization algorithms need not be (and often are not) good ML algorithms. In ML we have to optimize a function that's getting revealed to us slowly, one datapoint at a time. The true function typically involves a continuum of datapoints. This is where we can bring probability into the picture

The optimization techniques required to actually fit models are almost all powered by some form of gradient descent, and integration is usually required in truly probabilistic models to go from a density function to predictions.
Eh, optimization and the occasional bits of analysis show up in ML more often than traditional vector-calculus.
All of statistics and machine learning involves lots of integrals and derivatives. For example: expected values are integrals, and model fitting is done by hill climbing in the direction of the derivative.
> "Bayesian Data Analysis" by Andrew Gelman is another great read.

If you want to read that book you need real analysis more specifically measure theory (unless that subject is in probability theory for you). You cannot get into the last few chapters without it. Dirichlet Process are described using measures.

I don't believe you need multivar calc or info theory. Info theory stuff are used but not as often. I believe you're slanted toward researcher phd position. Gini index, entropy, etc... and such are taken as given when needed.

You don't really need measure theory. It's true that the last chapter in the book (in the 3rd edition) uses measure theory, but it's the only one.

http://andrewgelman.com/2017/08/02/seemingly-intuitive-low-m...

My recollection is that you need neither real analysis nor measure theory to appreciate it, but it's been a while since I read it. You might get more out of it if you have studied those.

I disagree on multivar calc. Statistics often makes use of matrix derivatives. I have found it helpful to know.

What's required as a prereq to Measure Theory? Any suggestions on good resources for learning Measure Theory? I have a vague notion that Probability and Measure Theory are intertwined / related somehow, but have never studied the latter specifically.
The relationship is that measure theory provides the theoretical framework for making probability theory rigorous.

The only formal prerequisite for learning measure theory is that you should know series and sequences. For a reference, I'm not so sure, maybe Halmos's book. The important parts are probably:

- Monotone convergence theorem

- Dominated convergence theorem

- The construction of the Lebesgue integral

- Fubini's theorem and Tonelli's theorem

I would probably try not to get bogged down in details of construction of measures (unless you like that) and take the Lebesgue measure (essentially length) as given. Also check out the Radon-Nikodym theorem which states that we can always (ish) work with density functions.

The typical prerequisite for measure theory is a two-semester real analysis course, a la Rudin or any of its alternatives (I particularly like Pugh's book). A solid topological background is also a good idea, although you can probably get away with whatever you learned in real analysis. Two standard measure theory texts are Folland's Real Analysis and the first half of Rudin's Real and Complex Analysis.
Probability theory is the study of distributions of constant measure in measure theoretic terms. There are some good resources that mtzet mentions, but I just wanted to note that a lot of the integration terminology which you take for granted reading about probability theory is formally defined in measure theory. It's also very nice for making signal processing math more formal.
I'm taking a measure theory course right now, and we primarily use some set theory and some topology of R^n.
Great class and great professor. One of my favorite classes from my degree.
I disagree that you need a solid founding in information theory. Almost all that I've seen about IT in ML is minimizing the KL divergence, which can be learned by browsing the wiki page.
Well, information theory isn't much more than the logarithm of probability theory, so it doesn't hurt to learn it anyway. The only thing you need to know is that given a probability distribution P there exist a compression scheme to encode a value X with a message of P_length(X) = log(1/P(X)) bits. This can be summarised as BITS = log(1/PROBABILITY). Entropy is just the average number of bits you need to encode a random value from distribution P with the compression scheme of distribution P, i.e. E_P[P_length(X)]. The KL(P,Q) divergence is when you encode a random value from distribution P with the compression scheme of distribution Q. Say you're compressing english text but you're using a compressor tailored to spanish. The KL divergence is how many extra bits you need (on average) compared to encoding the english text with the english compressor:

KL(P,Q) = E_P[Q_length(X)] - E_P[P_length(X)]

> information theory isn't much more than the logarithm of probability theory

stealing

It depends. All that is essential for an autombile engineer is not essential for a taxi driver.
Maybe more all that is essential for a molecular biologist isn't necessary for a general practitioner? It's just... those conference calls where you're explaining that because the classifier is working really well now doesn't mean that we can use it in production, those calls can get difficult and annoying, and sometimes the "other side" wins - with predictable results.

ha ha ha!

You bring up a very important point and a difficult one which is, if the decision making is in the hands of someone who does not understand the nuances too well nor has the time or inclination, what do you do ?

If your salary is going to depend on how many models you pushed out and not how well they continued to perform, many will optimize over the number of models pushed out.

A major source of problem (and sometimes a gift) is that you cannot prove a empirical statistical claim true or false in finite time. There is always this non-zero probability that the weirdest thing would happen. It could be just sheer bad luck that the model did so poorly in this cycle.

That's not because you need little background in information theory. That's because KL-divergences are such a universal info-theoretic quantity that if you deeply understand them, you understand much to most of information theory.

This is like saying, "You don't need to really know calculus, just integrals."

Information theory is pretty central to model selection.
Information theory and probability are basically the same thing.
You can actually get the latest edition of Elements of Stastical Learning for free as a (legal) pdf from the author!

https://web.stanford.edu/~hastie/ElemStatLearn/

I disagree about the graph theory as well. Unless you are doing things with learning on networks you won't need it.

I think a solid background in linear algebra, multivariate calculus, and convex optimization will take you really far.

A lot of data is best represented graphically, and while you can shoehorn this sort of data into a vector space by projecting using a graph distance metric, the results are likely to be inferior.
I agree a lot of data can be represented graphically, but if you look at the literature it mostly is getting shoehorned into vector spaces. This doesn't mean people shouldn't learn about Graph Laplacians and friends, but I don't think it's an entry requirement.
Probability Theory: the Logic of Science is mindblowing, not a page turner, but if you can digest it is is very good.
For calculus, I'd skip the more physics like finding of integrals and derivatives. What matters is understanding the concepts of integrals and derivatives, and knowing properties like the chain rule. It pays much less to know that the integral of 1/x is ln(x) (or the other way round).

The linear algebra and probability theory are most important imho. I'd also distinguish between probability theory and statistics. Both are important, but they are distinct disciplines.

These are all brilliant books, but I feel like anyone who is ready for them wouldn't need to be asking this question.
Do you have any good guides for the calculus required to do ML? Is it just the basic Calc AB from high school?
What about the Pattern Recognition book by Bishop? I am reading it now and its more approachable than the Elements of Statistical Learning book
+1 for "Elements of Statistical Learning", this is the basis for most rigorous intro to ML classes
Game Theory would probably be more valuable to understand than graph theory. Just my 2 cents.
What about "Pattern Recognition and Machine Learning" by Christopher Bishop?
you mean i cant just bang out some ipythons and the matrix forms around me?

thanks for the list! the only roadblock i've ran into getting into many of these topics are book prices :O usually they are pretty steep

Actually, the MacKaye, Jaynes and Goodfellow books are available for free online. Enjoy!