| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by HarHarVeryFunny 753 days ago
	What do you mean by plain autodiff being mostly useless with normal/discrete branching? Wouldn't branches normally just be "ignored" by autodiff - each training sample being a different computational graph (but with parts in common) due to branching points, so the only effect of branching is which computational graph gets executed and backpropagated through? What's the general type of use case where this default behavior is useless, and "non-discrete" (stochastic?) branching helps?

1 comments

frankling_ 753 days ago

That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input.

The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.

link

Y_Y 753 days ago

Plain autodiff gives the correct derivative, but you modify the derivative to push people towards your global minimum?

link

HarHarVeryFunny 753 days ago

Thanks, but could you briefly expand on what's happening in the minimal if (x >= 0) case with the discograd modification? What source code modification could the user could have made to achieve the same effect?

link

frankling_ 753 days ago

In DiscoGrad, smoothing would be applied by adding Gaussian noise with some configurable variance to x and running the program on those x's. The gradient would then be calculated based on the branch condition's derivative wrt. x (which is 1) and an estimate of the distribution of the condition (which is Gaussian).

In this specific example, the smoothed derivative happens to be exactly the Gaussian cumulative distribution function, so the user could just replace the program with that function. However, for more complex programs, it'd be hard to find such correspondences manually.

link