| Reverse mode autodiff is best implemented with a non-local transformation of the program: First you run the original operations forward; then you run corresponding operations in reverse order. You can do this with a library by implementing "number" types that, as a side effect of arithmetic operations, record those operations onto a "tape", so that corresponding (different) operations can be played back later in reverse order. Unfortunately, those side effects have a cost during the forward pass. And during the backwards pass you are essentially running a little interpreter to execute your instructions. Putting this stuff into the compiler lets the forward pass just be normal, native code on doubles/floats, and the same for the reverse-pass code. Moreover, all of this can now be worked on by the optimizing compiler. This is especially important for embedded applications where you can't afford all this interpretation. I guess the original TensorFlow "computation graph" approach is a little different to the "tape" I described because the graph is built explicitly rather than through side effects, but that just makes even more clear that you're really assembling an AST for some /other/ language and (when not using XLA) running an interpreter. In principle some suitably powerful macro language with full access to the AST might be able to do good compile-time reverse-mode AD, but I am not aware of any language with powerful enough macros. Again, this is because the transformation is not just a local pattern replacement; it involves flipping the code upside down. What's funny is that physicists have been able to do this kind of program transformation in a slightly clunky(?) way -- FORTRAN in, FORTRAN out -- since the 70s, supposedly. It all makes me think that our ideas about what a "language" is, what a "compiler" is, what a "library" is, are all stuck in convention and prematurely ossified. The first compilers, which we celebrate, looked to their users like codegen (which we detest, I think)! I would love for the boundary between "language" and "compiler" to be broken down more so that AD could be more easily done "within the language"; maybe one day that will happen in Julia, or Nim, or Jai, or Terra. But for now I think I agree with the designers that the best hope for good results, and really the most straightforward way, is to just do it in the compiler. |
[1] http://blog.rogerluo.me/2019/07/27/yassad/
[2] https://docs.julialang.org/en/v1/manual/metaprogramming/inde...
[3] https://mikeinnes.github.io/IRTools.jl/latest/#Evaluating-IR...