| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by frankling_ 750 days ago
	That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input. The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.

2 comments

Y_Y 750 days ago

Plain autodiff gives the correct derivative, but you modify the derivative to push people towards your global minimum?

link

HarHarVeryFunny 750 days ago

Thanks, but could you briefly expand on what's happening in the minimal if (x >= 0) case with the discograd modification? What source code modification could the user could have made to achieve the same effect?

link

frankling_ 750 days ago

In DiscoGrad, smoothing would be applied by adding Gaussian noise with some configurable variance to x and running the program on those x's. The gradient would then be calculated based on the branch condition's derivative wrt. x (which is 1) and an estimate of the distribution of the condition (which is Gaussian).

In this specific example, the smoothed derivative happens to be exactly the Gaussian cumulative distribution function, so the user could just replace the program with that function. However, for more complex programs, it'd be hard to find such correspondences manually.

link