Hacker News new | ask | show | jobs
by taliesinb 2478 days ago
While the gradients don’t technically exist, you usually use the limiting value from one side (e.g for the commonly used ReLU activation, you can use either 0 or 1). This is justified by the observation that randomly initialized networks are extremely unlikely to encounter these particular values without some kinder of deeper conspiracy happening. You could put this on a mathematical footing by saying the set of non differentiable points has measure zero, for example.
1 comments

> randomly initialized networks

But "networks" here, you're thinking of ANNs, yes?

But in the context of proposing differential programming as an addition to a general purpose language (and where the proposal explicitly brings up a bunch of cases outside of deep learning), is it fair to justify behavior based on what makes sense in a popular but narrow application?

It’s a good question what the plans are for DP languages to handle situations where non-differentiability shouldn’t be ignored.

For sensitivity analysis it might be disastrous to conclude that an output is sensitive to an input when it is actually not, merely because an intermediary ReLU hit 0, for example.

A conservative approach could be to define versions of the relevant functions that threw exceptions at such points, or that also calculated the trusted margin of the resulting gradients; non-differentiability would then produce a zero trust margin.

he/she addressed that - the points at which the function isn't differentiable has measure zero. besides this isn't some kind of new hack - one sided limits (and therefore derivatives) were invented exactly for such cases (min, max, abs) and have been used by mathematicians probably since just about when calculus was invented.