Hacker News new | ask | show | jobs
by fpgamlirfanboy 811 days ago
> and if addition and multiplication work the way you expect and propagate derivatives correctly - then you are using dual numbers

you literally started out your miraculous comment with

> This new algebra is called the ring of "dual numbers." The difference is that instead of adding a new element "i" with i² = -1, we add one called "h" with h² = 0!

not some observation about caching derivatives.

so i'll repeat myself for the 3rd time: there are no magical numbers anywhere in pytorch or tensorflow or cafe or any other serious autodiff implementation that abide by the rules you so jubilantly exclaim about.

2 comments

See my comment to Mike. I think you're making a valid point here, which is that dual numbers by themselves are not powerful enough to automatically generate derivatives of arbitrary functions, especially given that those functions could be implemented in a FPU core, or using methods like lookup tables that don't lend themselves to dual number differentiation.

Dual numbers help us automatically differentiate things when the functions themselves are implemented as analytic power series that we have to explicitly compute without accelerator help. In such cases we can indeed use them. But to your point, serious forward AD engines need to differentiate functions that are computed in one shot by accelerator hardware.

However Mike makes a very valid counterpoint when he shows forward mode AD in Torch. I believe a careful analysis of Torch's implementation here could bring this conversation to a productive and satisfying conclusion for all participants and our public audience.

My big question here is to what degree did the implementers try to respect the dual number approach? Did they implement a dual tensor class for instance? Do they automatically lift some ordinary computations into dual tensor computations? I honestly have my doubts there.

I have confidence that we can get to the bottom of this. I think that Mike actually does care about automatic differentiation, and would be receptive to discussing this point of subtlety that naive dual number implementations may not be enough for industrial strength AD systems, with clear examples of code and clear reasoning as to how dual numbers fail in important cases.

Thank you for repeating yourself three times. It seems like you think that the dual number algebra involves "magic woo numbers." It seems like you haven't really worked through this stuff too much. I would suggest reading some of the resources above, such as the MIT lecture series. The rest of your points I think I have already addressed, though you ignored in your reply - I've said Pytorch does reverse mode diff several times at this point.
> It seems like you haven't really worked through this stuff too much

yup not at all - i just wandered in off the street and knew accidentally that you were talking about non-standard analysis.

> The rest of your points I think I have already addressed

please show me the source line number in pytorch or tensorflow that defines this number

> we add one called "h" with h² = 0!

Sure, right here: https://github.com/pytorch/pytorch/blob/main/torch/autograd/...

Here's the documentation: https://pytorch.org/tutorials/intermediate/forward_ad_usage....

> When an input, which we call “primal”, is associated with a “direction” tensor, which we call “tangent”, the resultant new tensor object is called a “dual tensor” for its connection to dual numbers[0].

This could help settle the objection that torch doesn't implement dual number based Forward Accumulation.

But I'm wondering if it does it by implementing dual tensors and automatically 'lifting' ordinary tensor computations into dual tensor computations? That would be a little surprising to me.

The more common approach I have seen is that we decorate existing operations with additional logic to accumulate and pass on a derivative value as well as the actual value during evaluation. This can be important for instance for transcendental functions, which might be computed with methods like lookup tables and approximate series, which do not necessarily lend themselves to accurate dual number computations, but do have a straightforward formulas for the derivative. It can also be a requirement when our transcendentals are computed in the FPU, which does not expose any power series to automatically thread dual our numbers through.

It would make sense in the case of something like pytorch if this were the case, since it could be a bit of a stretch to expect the correct numbers to appear if only we just compute everything with dual numbers. Indeed, the original torch functions certainly exploit the FPU, so we very likely have to explicitly formulate a derivative in at least some cases.

I wonder if this observation could help heal the rift between the two positions here - it seems like your counterpart could be satisfied with the view that most forward mode AD it's not quite as "pat" as just injecting a dual numbers library into existing code, but requires careful extension to accurately accumulate the derivatives of each operation in the system.

I believe that reaching common ground around that fact could help your counterpart reach a satisfying conclusion here. The methods are clearly dual number in spirit, but may require more subtle implementation details then the traditional dual number story, which states "dual numbers get you free derivatives with no need to extend your functions". Pointing out how this does break down for general functions is not only true, but could serve as an olive branch and opportunity to advance this discussion.

Besides that remark, which I made in the intent of resolving a conflict and potentially fostering communication, I wanted to thank you for this amazing link that you shared. I did not know that pytorch had forward mode AD! I may just have to dig into it and see how they pull it off!

You seem somewhat obsessed with the idea that reverse-mode autodiff is not the same technique as forward-mode autodiff. It makes you,,, angry? Seems like such a trivial thing to act a complete fool over.

What's up with that?

Anyway, here's a forward differentiation package with a file that might interest you

https://github.com/JuliaDiff/ForwardDiff.jl/blob/master/src/...

I can't reply to the guy saying julia is the only one. But there are others.

Ceres uses dual numbers

https://github.com/ceres-solver/ceres-solver/blob/master/inc...

This library from google is used everywhere in robotics, so it's hardly some backwater little side project.

So does c++ autodiff https://github.com/autodiff/autodiff/blob/main/autodiff/forw...

So does Eigen: https://eigen.tuxfamily.org/dox/unsupported/AutoDiffScalar_8...

it's amazing to me that pointing out a straight up mathematically factual inaccuracy is considered "angry" and "acting a fool".
> there's a reason no one uses dual numbers (non-standard analysis) for anything (neither autodiff nor calculus itself)

wrong

> there are no [dual] numbers anywhere in [...] any other serious autodiff implementation

wrong

> please show me the source line number

did

> i'm wrong and this other guy is right

correct.

> this is why i hate this kind of "TIL, gee whiz" math tidbits

acting a fool. angrily so.

is this like some kind of version of truman show? the full sentence is

>please show me the source line number in pytorch or tensorflow that defines this number

why? because the original comment makes a claim about pytorch. it's all right there in black and white.