|
|
|
|
|
by taliesinb
2478 days ago
|
|
While the gradients don’t technically exist, you usually use the limiting value from one side (e.g for the commonly used ReLU activation, you can use either 0 or 1). This is justified by the observation that randomly initialized networks are extremely unlikely to encounter these particular values without some kinder of deeper conspiracy happening. You could put this on a mathematical footing by saying the set of non differentiable points has measure zero, for example. |
|
But "networks" here, you're thinking of ANNs, yes?
But in the context of proposing differential programming as an addition to a general purpose language (and where the proposal explicitly brings up a bunch of cases outside of deep learning), is it fair to justify behavior based on what makes sense in a popular but narrow application?