|
|
|
|
|
by frankling_
750 days ago
|
|
That's right, plain autodiff just ignores branches. Our canonical "why is this even needed" example is a program like "if (x >= 0) return 1; else return 0", x being the input. The autodiff derivative of this is zero, wherever you evaluate it, so if you sample x and run your program on each x as in a classical ML setup, you'd be averaging over a series of zero-derivatives. This is of course not helpful to gradient descent. In more complex programs, it's less blatant, but the gist is that just averaging sampled gradients over programs (input-dependent!) branches yields biased or zero-valued derivatives. The traffic light optimization example shown on Github is a more complex example where averaged autodiff-gradients are always zero. |
|