| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by cl3misch 1966 days ago
	I don't see how static AD removes the need to store the network state. Is this a fundamental property of static AD? Also, your statement sounds like pyTorch/TF are doing AD numerically, which is not the case. They build the analytical gradient from the traced computation graph.

2 comments

wsmoses 1966 days ago

Reverse mode AD can always get into situations where it needs to store original values (i.e. network state).

One advantage, however, of doing a more whole-program approach to AD rather than individual operators is that one might be able to avoid caching values unnecessarily. For example if an input isn't modified (and still exists) by the time the value is needed in the reverse pass, you don't need to cache it but can simply use the original input without a copy.

And yes PyTorch/TF tend to perform a (limited) form of AD as well, rather than do numerical differentiation (though I do think there may be an option for numerical?)

I wouldn't really position a tool like Enzyme as a competitor to PyTorch/TF (they may have some better domain-specific knowledge after all), but rather a really nice complement. Enzyme can take derivatives of arbitrary functions, in any LLVM-based language rather than the DSL of operators supported by PyTorch/TF. In fact, we built a plugin for PyTorch/TF that uses Enzyme to import custom foreign code as a differentiable layer!

link

cl3misch 1966 days ago

> One advantage, however, of doing a more whole-program approach to AD rather than individual operators

I was under the impression that the big ML frameworks (and surely JAX with jit) are doing optimization on the complete compute graph, too.

I didn't want to make this discussion too TF/pyTorch focused (I'm not even a ML researcher). But your optimization claims sound like the other AD frameworks are not doing any optimization at all, which is not the case.

I was also thinking about derivatives of functions which are doing something iterative on the inside, like a matrix decomposition (combined with linear solve and/or matrix inversion). While a "high level" AD tracer can identify an efficient derivative of these operations, your LLVM introspection would only be able to compute the derivative through all the internal step of the matrix decomposition?

link

wsmoses 1964 days ago

Oh for sure, any ML framework worth its salt should do some amount of graph rewriting / transformations.

I was (perhaps poorly) trying to explain how while yes AD (regardless of implementation in Enzyme, PyTorch, etc) _can_ avoid caching values using clever tricks, they can't always get away with it. The cache-reduction optimizations really depend on the abstraction level chosen for what tools can do. If a tool can only represent the binary choice of whether an input is needed or not, it could miss out on the fact that perhaps only the first element (and not the whole array/tensor) is needed.

Regarding Enzyme v JaX/etc, again I think that's the wrong way to think about these tools. They solve problems at different levels and in fact can be used together for mutual benefit.

For example a high-level AD tool in a particular DSL might know that algebraically you don't need to compute the derivative of something since from the domain knowledge it is always a constant. Without that domain knowledge, a tool will have to actually compute it. On the other side of the coin, there's no way such a high level AD tool would do all the drudgery of invariant code motion, or even lower level scheduling/register allocation (and see Enzyme paper for reasons why these can be really useful optimizations for AD).

In an ideal world you want to combine all this together and have AD done in part whenever there's some amount of meaningful optimization (and ideally remove abstraction barriers like say a black box call to cudnn). We demonstrate this high and low level AD in a minimal test case against Zygote [high level Julia AD], replacing a scalar code which is something Zygote is particularly bad at. This thus enables both the high level algebraic transformations of Zygote and the low level scalar performance of Enzyme, which is what you'd really want to do.

It looks like the discussion of this has dropped off for now, but I'm sure shoyer would be able to do a much better job of listing interesting high-level tricks JaX does [and perhaps low level ones it misses] as a consequence of its choice of where to live on the abstraction spectrum.

Also thanks for reminding me about matrix decomposition, I actually think there's a decent chance of doing that somewhat nicely at a low level from various loop analyses, but I got distracted by a large fortran code for nuclear particles.

link

6d65 1966 days ago

You might be right, I seem to have confused the pytorch(and tf eager mode) differentiation approaches with numerical methods.

I'm making a Pytorch inspired ML framework, and indeed, each op node, defines also a backward pass, which is a manual definition of a derivative. And going backwards over the ops graph, and combine derivatives for each op via chain rule to get the final gradient, looks indeed like a runtime analytical method rather than a numerical one.

The advantage of an automatic AD is not having to define the backward pass for each op, and the function that calculates the derivative being generated at compile time.

I've left the project marinate a bit, so the little knowledge I had is fading away.

link