|
|
|
|
|
by cl3misch
1966 days ago
|
|
I don't see how static AD removes the need to store the network state. Is this a fundamental property of static AD? Also, your statement sounds like pyTorch/TF are doing AD numerically, which is not the case. They build the analytical gradient from the traced computation graph. |
|
One advantage, however, of doing a more whole-program approach to AD rather than individual operators is that one might be able to avoid caching values unnecessarily. For example if an input isn't modified (and still exists) by the time the value is needed in the reverse pass, you don't need to cache it but can simply use the original input without a copy.
And yes PyTorch/TF tend to perform a (limited) form of AD as well, rather than do numerical differentiation (though I do think there may be an option for numerical?)
I wouldn't really position a tool like Enzyme as a competitor to PyTorch/TF (they may have some better domain-specific knowledge after all), but rather a really nice complement. Enzyme can take derivatives of arbitrary functions, in any LLVM-based language rather than the DSL of operators supported by PyTorch/TF. In fact, we built a plugin for PyTorch/TF that uses Enzyme to import custom foreign code as a differentiable layer!