Hacker News new | ask | show | jobs
by Muximize 2733 days ago
Isn't this also what the Julia team is trying to do? https://julialang.org/blog/2018/12/ml-language-compiler
1 comments

We are building tooling which will allow you to put together differential equations and neural networks in any way you please. This is just one way of putting the two together. The applications we were looking at are different, but it utilize similar tooling. We have forward-mode, reverse-mode, forward sensitivity analysis, and adjoint sensitivity analysis (the method mentioned here, which is actually from at least the 90's) implemented and tested with the DifferentialEquations.jl integrators. We had a recent paper exploring the timing differences between the different sensitivity analysis (gradient calculation) methods:

https://arxiv.org/abs/1812.01892

The neural ODE falls into the case where the number of parameters is large, and so the traditional adjoint methods seem to do best there when optimized. So in some sense, we aren't "trying" to do it, but have had library implementations for these gradient calculations for a few years now. The documentation is here: http://docs.juliadiffeq.org/latest/analysis/sensitivity.html

But we are heading in new directions with this as well. The traditional adjoints seem to do best here because of inefficiencies with tracing-based reverse-mode autodiffs like the kind found in Flux.jl or Autograd. Basically, those kinds of tape-based autodiffs rely on each operation being expensive, which isn't true for the internals of an ODE solver (but tends to be true for applications like neural nets where large matrix multiplications dominate). This means that the overhead of building the tape is really noticable and thus it doesn't do well in this application. Additionally, the tape is dependent on the input because the trace of the operations is dependent on what branches are taken. This means that, even if the ODE solver is fully optimized and every piece is JIT compiled, the backprop tape itself cannot be compiled and saved. Again, this isn't an issue for large matmuls, but this is another factor that comes into play when the operations are not sufficiently costly.

However, the blog post you link to discusses source-to-source AD as an alternative. Source-to-source builds the backprop code directly from the full source, doing both branches at the same time. This means you can AD the source and then compile it. This methodology solves the issues we found with tracing-based autodifferentiation for this application, so I'll be happy when we get to testing it. Zygote.jl isn't quite ready for source-to-source AD on the full DifferentialEquations.jl code, but it's getting close.