Hacker News new | ask | show | jobs
by wenc 2478 days ago
I'm curious (for folks in the know): how does differentiable programming handle non-differentiable points? Can it detect non-differentiable/non-smooth functions?

Non-smooth functions like abs(), max(), min() have points where derivatives do not exist. ReLU functions are non-differentiable at their hinge points.

Disjoint IF-THEN-ELSE conditions are discontinuities in the function space, and are traditionally handled in optimization with mixed-integer formulations (i.e. split up the space and do something clever like branch-and-bound to find the optimum).

4 comments

Other people have answered what they do, but this is the big gap between people talking about 'differentiable programming' in theory, and having it actually work in practice.

It's true that once you have control flow, the gradient quickly becomes meaningless. I posted an example here: https://news.ycombinator.com/item?id=20892287

That's also the biggest reason I tend to find much of this "differentiable programming" stuff to be overhyped. It's hard to reformulate programs in a way s.t. the derivative can mean something meaningful. And I'm not convinced traditional languages will benefit.

That's not to say that there isn't cases where your program can be formulated to have a meaningful derivative.

See this differentiable ray tracer: https://people.csail.mit.edu/tzumao/diffrt/

> It's hard to reformulate programs in a way s.t. the derivative can mean something meaningful.

Really? The gradients computed by AD are the exact answer to the following question: if I were to change this input or parameter an infinitesimal amount, how much would it change the output of my function? That is always meaningful (when it is defined), and means what I just said. You can easily make functions where it is not defined, of course, just like you can make a sphere into two spheres with the Banach-Tarski theorem!

But there are vast, vast forests of numerical computation employed in industry, science, finance, engineering, where it is almost always defined.

And even for more “chunky” computations where the non-differentiability is more severe, there are algorithms like REINFORCE that you can use to estimate gradients through these parts.

It's not meaningful in the sense that it won't correspond with your intuition of "will increasing the input increase my output". Presumably the point of differentiable programming isn't just getting the derivative for fun, it's for optimizing some quantity.

For example, take this code.

     x,y
     for (int i=0; i<x; i++)
         y += 1
     return y
It's technically true that the gradient of 0 is correct (modulo boundaries). But if someone was trying to optimize this function, that's not very helpful.

I believe REINFORCE is not of much help either - it's not magic. I'm not aware of any stochastic gradient estimators that are helpful in this case (although if there is a method I'd like to hear about it).

While the gradients don’t technically exist, you usually use the limiting value from one side (e.g for the commonly used ReLU activation, you can use either 0 or 1). This is justified by the observation that randomly initialized networks are extremely unlikely to encounter these particular values without some kinder of deeper conspiracy happening. You could put this on a mathematical footing by saying the set of non differentiable points has measure zero, for example.
> randomly initialized networks

But "networks" here, you're thinking of ANNs, yes?

But in the context of proposing differential programming as an addition to a general purpose language (and where the proposal explicitly brings up a bunch of cases outside of deep learning), is it fair to justify behavior based on what makes sense in a popular but narrow application?

It’s a good question what the plans are for DP languages to handle situations where non-differentiability shouldn’t be ignored.

For sensitivity analysis it might be disastrous to conclude that an output is sensitive to an input when it is actually not, merely because an intermediary ReLU hit 0, for example.

A conservative approach could be to define versions of the relevant functions that threw exceptions at such points, or that also calculated the trusted margin of the resulting gradients; non-differentiability would then produce a zero trust margin.

he/she addressed that - the points at which the function isn't differentiable has measure zero. besides this isn't some kind of new hack - one sided limits (and therefore derivatives) were invented exactly for such cases (min, max, abs) and have been used by mathematicians probably since just about when calculus was invented.
> on-smooth functions like abs(), max(), min() where derivatives do not exist

While these functions are not differentiable they are sub-differentiable. Which is an extension of differentiability.

Subderivatives are set, if the set only contain one point the function is differentiable. Otherwise it doesn't matter for gradient descent you can just use any element of the set.

> Disjoint IF-THEN-ELSE conditions are discontinuities in the function space

Same, it doesn't matter, they mathematically are equivalent to indicator functions and are subdifferentiable.

https://en.wikipedia.org/wiki/Subderivative

My understanding is you don't explicitly handle that. For whichever place you're evaluating, you have some computation graph, and you use the elementary operators on that graph. So, if you have defined ReLU with an `if x >= 0 return x` clause, then if you evaluate the derivative at 0, that's the branch you go down, and you say the answer is 1.