|
|
|
|
|
by scarecrow112
2418 days ago
|
|
The post lists Auto Differentiation as one of the techniques that can overcome non-differentiable loss functions. Can someone explain how this is even possible? After all automatic differentiation[1] is a way to compute gradients/derivatives in a way which could become costly if otherwise done(symbolic differentiation[2]). The function or the operations defined in the function(in case of source-to-source differentiation[3]) needs to be differentiable. [1] - https://www.youtube.com/watch?v=z8GyNneq5D4 [2] - https://stackoverflow.com/a/45134000 [3] - https://github.com/google/tangent EDIT: 1. Added reference. 2. Formatting |
|
The original tensorflow work the same, but instead of running the graph every time, it embeds the non differentiable control mechanism in the code of the graph, which can more efficiently (without needing the host language to build a new one every time) create the correct differentiable tape for each run based on it's input. And source-to-source differentiation work exactly the same way, except instead of having to use a DSL (like the tensorflow graph API) and compile it, it simply uses the host language and compiler directly (so you don't need effectively two languages). Which is the case of Julia's Zygote and Swift for Tensorflow.
The only alternative to this piecewise differentiation that I know of would be creating a soft version of discrete operators, such as replacing step functions with sigmoids and case/switch/elsif operators as softmax selectors for example, which is not what any of those libraries do (it would not be easy to make it converge as the graph would be much more complex at each backward pass). In this case you could have one single graph that includes every branch though.