Hacker News new | ask | show | jobs
by delluminatus 3809 days ago
Great post, as an AD tutorial and as a (an?) Haskell exercise. Having known nothing about AD before, I feel like I have a good understanding of what it is -- as he says, it's so simple -- but I don't understand why the algorithm is so much faster. Just looking at the differentiator function and the AD function, it actually appears that the AD should take longer because it does more computation per step (both the function and the derivative). But it seems like every article or paper is talking about how to implement AD, not why the algorithm is so efficient. Does anyone happen to know of a good article or paper about that? Ideally, one just as nice and comprehensible as this!
4 comments

The first alternative builds a large tree structure, and then evaluates the whole tree structure afterwards.

So first it blows up the size of the expression to process and then it calculates the result. A lot of those calculations will be redundant

The second one not only avoids evaluating the tree separately, but "prunes" a lot of the evaluation automatically by effectively short-circuiting the whole process. Consider (with the caveat that my Haskell understanding is rudimentary at best and it was about 20 years since I last did anything involving symbolic differentiation) :

   Product e e' -> Product e (diff e') `Sum` Product (diff e) e'
Followed by a separate run with:

    Product e e' -> ev e * ev e'
vs

    Product e e' -> let (ex, ed)   = ev e
                            (ex', ed')       = ev e'
                    in (ex * ex', ex * ed' + ed * ex')
(I pick the "Product" rule as an example because it is one of the ones that blows up the size of the tree)

Lets say you do something simple like Product X X. You get Sum (Product X One) (Product One X) out, and then you have to evaluate each node.

In the second case, you match Product e e'. You process X and assign (x,1) to (ex,ed), and process the second X and assign (x,1) to (ex', ed'), and then return (ex * ex', ex * ed' + ed * ex').

In the first case, you've first differentiated 3 nodes (Product + 2x "X"), then evaluated the 7 nodes that was produced as output, for a total of ten nodes processed.

In the second you've evaluated/differentatiated 3 nodes in one go without the intermediate step of having to evaluate a much larger tree.

In a large example, the number of nodes in the differentiated output quickly explodes and subsequent evaluation would increase rapidly in cost.

Memoizing the symbolic differentiator ought to eliminate the blowup, as I commented in the other thread the other day.
From an asymptotic complexity viewpoint, I don't see any difference between the two algorithms (AD versus building an expression tree and doing it symbolically, then evaluating). Both are linear in the "size" of the expression. So I don't understand what you mean by "quickly explodes".
See here:

http://h2.jaguarpaw.co.uk/posts/why-is-naive-symbolic-differ...

In summary, yes both functions are linear, but the size of the symbolic derivative is quadratic.

The result of the symbolic derivation grows faster than linearly, as you can see from the Product rule for example.

As a result, while the evaluation is linear with input tree size, the tree it is evaluating is typically much larger than the input expression.

The clearest article I've seen on the subject is this one:

https://colah.github.io/posts/2015-08-Backprop/

> as a (an?) Haskell exercise

In american english, this would be "a Haskell exercise," pronouncing "a" to rhyme with "hey" or "ah." In british english, it would be "an 'askell exercise."

I'm american, though. British people please weigh in.

> I don't understand why the algorithm is so much faster

I hope this helps!

http://h2.jaguarpaw.co.uk/posts/why-is-naive-symbolic-differ...