Hacker News new | ask | show | jobs
by frankling_ 750 days ago
That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input.

The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.

2 comments

Plain autodiff gives the correct derivative, but you modify the derivative to push people towards your global minimum?
Thanks, but could you briefly expand on what's happening in the minimal if (x >= 0) case with the discograd modification? What source code modification could the user could have made to achieve the same effect?
In DiscoGrad, smoothing would be applied by adding Gaussian noise with some configurable variance to x and running the program on those x's. The gradient would then be calculated based on the branch condition's derivative wrt. x (which is 1) and an estimate of the distribution of the condition (which is Gaussian).

In this specific example, the smoothed derivative happens to be exactly the Gaussian cumulative distribution function, so the user could just replace the program with that function. However, for more complex programs, it'd be hard to find such correspondences manually.