| > Most autodiff packages (such as Pytorch) use something not much more advanced than this pytorch absolutely does not use the dual number formulation - there are absolutely no magic epsilons anywhere in pytorch's (or tensorflow's) code base. what you're calling duals are the adjoints where are indeed stored/cached on every node in pytorch graphs. there's a reason no one uses dual numbers (non-standard analysis) for anything (neither autodiff nor calculus itself): because manipulating infinitesmals like this is fraught formal manipulation (it's algebra...) where as limits are much more rigorous (bounds, inequalities, convergence, etc.). my favorite question to ask the non-standard analysis n00bs is: please tell me under what conditions this is true (dx/dy)(dy/dz)(dz/dx) = 1 edit: anyone that thinks i'm wrong and this other guy is right should go and do some reading, eg where this guy tried to make this same point and got shot down: https://math.stackexchange.com/a/341550 spoiler alert: there's a reason you had to learn epsilon-delta proofs and limits and it's not because your math professors are mean. this is why i hate this kind of "TIL, gee whiz" math tidbits - they're full of exclamation marks and fancy sounding words ("non-archimedean rings" oooo fancy) but almost always come from a wikipedia level understanding, not actual research. |
Pytorch does things slightly differently in that it is mostly focused on reverse-mode autodiff, and so it stores adjoints relative to the overall output rather than partial derivatives relative to the input, but this isn't really an entirely different thing, in the same way that the FFT isn't entirely different from the DFT.
There seems to be some confusion about the relationship between dual numbers and smooth infinitesimal analysis. Both have nilpotent elements, but with dual numbers the background logic is classical, whereas it isn't with smooth infinitesimal analysis.
EDIT: I see you've edited your post to try to get in some extra criticism after I've already responded. That's terrible form, so I'll just respond here.
Dual numbers are a nice way to get started with forward-mode autodiff, to which it is so related that the two are essentially the same thing with different labels. Pytorch instead uses reverse-mode autodiff. Reverse-mode and forward-mode autodiff are different, but not so different that they are entirely different things. Reverse-mode is, as I put it in my OP, "not much more advanced" than forward-mode, even if not identical.
What is entirely different, much more advanced, and what Pytorch really doesn't do, is anything like the "epsilon-delta proofs" you keep hanging your hat on. If Pytorch did that, it would be useless. The entire point of autodiff is to avoid such things.
Beyond that, I would suggest slowing down a bit as you are mixing quite a few things up. Nonstandard analysis has nothing to do with dual numbers at all, for instance. And you're very much misinterpreting that MSE post of mine you linked to (thanks!).