|
|
|
|
|
by HarHarVeryFunny
753 days ago
|
|
What do you mean by plain autodiff being mostly useless with normal/discrete branching? Wouldn't branches normally just be "ignored" by autodiff - each training sample being a different computational graph (but with parts in common) due to branching points, so the only effect of branching is which computational graph gets executed and backpropagated through? What's the general type of use case where this default behavior is useless, and "non-discrete" (stochastic?) branching helps? |
|
The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero.