Hacker News new | ask | show | jobs
by memexy 2198 days ago
Differentiating through control flow has never made sense to me. What does it mean to differentiate the following function: "f(x) = x > 0 ? x : -x"? If you plot this function you get a sharp corner at 0 which means it's not differentiable there because the limit from the left is -1 and the limit from the right is 1. Since 1 =/= -1 the derivative does not exist at 0.

So how are AD libraries claiming to differentiate such functions? Is there an implicit assumption that the user knows the derivative does not make sense at 0?

Edit: I just tried this and it gives the wrong answer without any hint that it's incorrect:

"""

julia> f

f (generic function with 1 method)

julia> f(0), f'(0)

(0, -1)

julia> f'(1), f'(-1)

(1, -1)

"""

6 comments

Yes, you can just consider it a piecewise differentiable function. That's not usually a problem in practice if you think of the derivatives more as heuristics in a search problem than absolute truth. Obviously how well your optimization works does depend on the differentiability attributes of your underlying function and how well your optimization scheme can handle it.
If subgradients are enough (-1 is correct subgradient at 0 in your example) then there are valid approaches for AD subgradient, see https://arxiv.org/abs/1809.08530
Thanks for the link.
if you want to be rigorous about it, you can often talk in terms of elements of the set of subgradients rather than gradients, and the convergence proofs for many of the popular optimization algorithms still go through.
Any references for sub-gradients and convergence proofs I can take a look at?
Depending on the level you're looking for, I'm always a huge fan of Rockafellar [0] for the more mathematical expositions (they're lovely, clear, and very well thought through).

Even in this case, though, subgradients may not exist (they're only guaranteed to exist for convex functions and any nonconvex function has, almost by definition, at least one point for which there exists no subgradient), in which case one talks about sub-derivatives instead. These always exist whenever a function is semicontinuous (such as the "if" statement example given in the GGP comment), even if it is not convex [1].

All of this is to say, the subject of general variational analysis is mathematically nice, but computationally extraordinarily difficult.

The first proofs of even local convergence of GD to local optima (not stationary points!) with high probability, even in the case of smooth, differentiable functions, only recently emerged as well, and the results are rather weak in the sense of limits [2]. Query-hardness has also been shown for many examples (even with unbounded computational time) in [3] even when the functions are smooth.

The non-smooth case becomes even harder and you have exponential-size query complexity in the number of dimensions even when you restrict that the function cannot "vary too much," locally (i.e., if it is required to be L-Lipschitz) [4].

-----

[0] For example, Variational Analysis covers a good amount of this stuff https://sites.math.washington.edu/~rtr/papers/rtr169-VarAnal...

[1] c.f., chapter 8 in [0].

[2] https://arxiv.org/abs/1602.04915

[3] https://arxiv.org/abs/1912.02365 in the stochastic case, and https://arxiv.org/abs/1710.11606 along with https://arxiv.org/abs/1711.00841 in the nonstochastic case

[4] See, e.g., 1.1.3 in Nesterov's Introductory Lectures in Convex Optimization (can be easily found online).

wow, the rockafellar book is fantastic. i can't believe i'd never run into it before. thanks!
Thanks.
The kinks correspond to a set of measure zero, which you will likely never hit during execution, so one can safely ignore the problem as not physically relevant. One way to think of the problem is that the cost function we’re differentiating is approximate/fake, and whatever it needs to be (at some special neighborhoods) to give us derivatives we consider sensible (in large regions).

After all, there’s nothing so special about the ReLU... It would be very very weird/unstable if our algorithms worked for ReLU, but not the link-smoothed version of ReLU.

Hmm... I'm not sure I agree.

All optimal points (for, say, optimizing a linear function) will lie on the extremal points of the feasible domain, many of which will be points where the constraint functions are not differentiable. In all cases you can turn nonlinear objective function optimization (say over f) into linear objective function optimization by adding a constraint f(x) ≤ t and moving t to the objective.

Now, I will agree that smooth optimization algorithms will work ok, but try optimizing abs(x) with GD; you'll find that the best possible error you can achieve (other than by sheer luck) will be ~O(L) where L is your stepsize.

> All optimal points

Sorry I screwed up a bit when writing that: an optimal point exists which lies on the extremal points of the feasible domain. In many cases, it can be shown that the optimal point lies on the extremal points of the feasible domain (when the objective is linear)

Yes, but we’re going to be restricted to O(L) final accuracy no matter what, for gradient descent (we could choose second order optimizations, etc, but that’s an orthogonal point — we’re happy to get within an epsilon ball of the answer).
> Yes, but we’re going to be restricted to O(L) final accuracy no matter what

This is not, in general, true for smooth functions so long as L is small enough (you can reach arbitrary accuracy with GD if L is smaller than ~ the reciprocal of the Lipchitz constant of a differentiable objective function but it need not be arbitrarily small).

My question wasn't about the theoretical aspects of measurability since any countable set of points will have measure 0 but about all AD libraries sweeping this kind of issue under the rug. Where in the Zygote docs is it mentioned that the absolute value function will give the wrong answer when differentiated?
In the same way that the ReLU derivative is not defined at x=0. Most of the time, in practice, this all doesn't really matter and you can still get gradient descent to work in a useful way.
the gradient doesn't exist but subgradients do exist (at points of non-differentiability) and that is still useful (case in point as someone else mentions ReLU)

https://see.stanford.edu/materials/lsocoee364b/01-subgradien...

Thanks. That's a nice tutorial.