| HN Mirror

I'll forward this to some of my GPU-expert-coauthors in the morning to see if they have a take on your questions. I think there's a few interesting facets here, though, so here's my take.

> What's the most notable way the GPU in particular comes into play?

Forward-mode and reverse-mode AD have different expressibility constraints on the kinds of programs that each can efficiently target in the face of dynamism, and GPUs also have fairly constrained programming models compared to the CPU. For me, a big part of the paper was the exploration of the intersection of these two sets of programmability constraints.

Section 2.2.4 and the experimental sections explain some of this in detail, but I think one of the more surprising results was that the benefits of fusing dynamic control flow into the broadcasted derivative kernel outweighed potential detriments e.g. warp divergence. It turns out newer GPU architectures give you more leeway in that regard than any of us on the team expected.

> How does caching come into play?

Depends on which kind of "caching" you're referring to.

If you mean tape-level partial derivative caching/memory usage:

Broadcasting a forward-mode derivative operator, as presented in this paper, can save on memory when it enables better fusion than reverse-mode on complicated kernels (resulting in fewer temporaries).

However, there is also a question of when this technique should actually be employed: during the forward pass, or during the reverse pass? If employed in the forward pass, then the primal and partial derivative calculations can be fused, reducing compute cost. However, doing so means that the memory required to store the partial derivatives is held captive until those derivatives can be backpropagated in the reverse pass. Conversely, employing the technique in the reverse pass allows you to free the partial derivative storage quickly, but features some redundant computation. Section 2.2.3 of the paper discusses this a bit.

If you mean instruction-level caching, i.e. efficient pipelining of memory into registers:

On the CPU, it's quite easy to thrash cache for high-arity dual number calculations (i.e. calculations where dual number instances carry around a long stack-allocated array of partial derivatives). Our experiment in Section 3.4.1 tries to characterize the analogous GPU behavior by measuring how occupancy scales with target calculation arity.

Also, there was definitely a bit of implementation work to ensure that loads from our GPU-backed "dual number" arrays coalesced properly, that indexing calculations were compiled away when possible, etc. The cool part is that the dual numbers themselves were just the implementation provided by the ForwardDiff package (https://github.com/JuliaDiff/ForwardDiff.jl), which contains no GPU-specific specialization, and they're automagically JIT-compiled for the GPU by CUDAnative (https://github.com/JuliaGPU/CUDAnative.jl).

> What about intrinsic condensing functions?

Hmm...I'm not positive I know what "intrinsic condensing functions" are. Apologies!