Hacker News new | ask | show | jobs
by backprojection 4623 days ago
I think it's worth noting that the problem with numerical differentiation, fundamentally, is that differentiation is an unbounded operator. In finite-differences, (the more obvious approach), you assume that your data are samples of some, general, function.

The problem then, is that that general functions have no (essential) bandlimit [1]. Remember that differentiation acts as a multiplication by a monomial, in the frequency domain [2]. Non-constant polynomials always eventually blow up away from 0, so in differentiating, you're multiplying a function by something that blows up, in the frequency domain. This means that, in the result, higher frequencies are going to dominate over lower frequencies, at a polynomial rate.

Let me be clear, the problem with numerical differentiation is not just that rounding errors accumulate, it's that differentiation is fundamentally unstable, and not something you want to apply to real-world data.

It depends very much on what your application is, however, I think generally a better approach to AD is to redefine your differentiation, by composing it with a low-pass filter. If designed properly, your low-pass filter will 'go to zero' faster (in the frequency-domain) than any polynomial, thus making this new operator bounded, and hence numerically more stable. It's not a panacea, but it begins to address the fundamental problem.

One example of such a filter is Gamma(n+1, n x^2)/Factorial[n], where Gamma is the incomplete gamma function [3].

In Python:

scipy.special.gammaincc(n+1,nx2) or mpmath.gammainc(n+1,nx2, regularized=True)

To see why this is a nice choice, notice item 2 in [4]. This filter is simply the product of exp(- x^2) (the Gaussian) multiplied by the first n-terms of the Taylor series of exp(+ x^2), (1/ the-Gaussian). Since this series converges unconditionally everywhere, as n-> +infinity, this filter converges to 1 for a fixed x (as you increase n), however, since it's still a gaussian times a polynomial, it always converges to 0 as you increase x, but fix n.

This is my area of research, so if anyone's interested I can give more details.

[1] https://en.wikipedia.org/wiki/Band-limit [2] https://en.wikipedia.org/wiki/Fourier_transform#Analysis_of_... [3] https://en.wikipedia.org/wiki/Incomplete_gamma_function [4] https://en.wikipedia.org/wiki/Incomplete_gamma_function#Spec...

7 comments

Read the article, it's not about computing derivatives of real world data (using finite differences or whatever), it's about exact derivatives of rational functions specified by a computer program. While what you wrote is interesting, none of it applies to this article.
^ Totally this. It's not meant for differentiating a function known only by its points. It's meant for evaluating nth-order derivatives of functions you can already evaluate at arbitrary values. The functions don't even have to be rational. As long as the function can be calculated by a computer by some method, so can its derivatives.

This comes in handy in all sorts of things where you might have designed this fancy kernel function to use in some process where you need to be able to calculate the value of the function and also of some derivatives--backpropagation for example [1].

As I was told by someone in the field, at one point, people used to generate machine learning publications simply by finding functions that required fancy mathematical tricks to find closed-form derivatives of chosen functions so that they would be usable in learning algorithms. But in many cases, this work is unnecessary if you use automatic differentiation.

It's a really cool concept, applicable in specific situations. If you need to know the derivative of a function that's not fully specified, you need numerical differentiation [2]. If you need a closed-form expression for your derivative function, that's when you need symbolic differentiation [3].

[1] http://en.wikipedia.org/wiki/Backpropagation [2] http://en.wikipedia.org/wiki/Numerical_differentiation [3] http://en.wikipedia.org/wiki/Symbolic_differentiation

> It's not meant for differentiating a function known only by its points. It's meant for evaluating nth-order derivatives of functions you can already evaluate at arbitrary values

This is correct. So my comment "I think generally a better approach to AD is to redefine your differentiation", doesn't really apply to AD. Essentially, AD is about differentiating functions, not data.

I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

Edit: Reminds me of the list on page 3 here: http://arxiv.org/pdf/math/9404236v1

> I find it funny how you think of differentiation in terms of "frequency domain", "bandlimit" and "filter". Very signal-processy, very engineery. A mathematician would speak about unbounded operators in a Banach space. :-)

This more or less depends on what kind of mathematician you're talking to. A mathematician working on Frames and signals almost certainly uses that language. But someone working in PDEs probably doesn't, even if the theory intersects a lot. It's probably a sign that there's nothing wrong with mathematicians and engineers sharing concepts :).

This is interesting. I am generally very skeptical about taking derivatives of functions computed from real-world data for exactly this reason. What I normally end up doing is applying some form of kernel smoothing (e.g. nearest K points with a Gaussian kernel) to approximate the function, and then compute derivatives of that instead.

Would you recommend another way?

This is pretty much exactly what I'm saying. So you're convolving with a gaussian kernel. That's the exact same thing as multiplying, in the frequency domain, by another gaussian. (convolution theorem, and the FT of a gaussian is a gaussian). The Gamma function filter goes one step further, it throws in a polynomial to make that gaussian 'flatter', which is more suitable for a low-pass filter.

Try the script at the end of this comment. Here's the output http://imgur.com/5948cW6

Notice that the 0-th one is just a gaussian. If you're interested, these things have traditionally been called HDAFs, I have a library to do this kind of thing https://bitbucket.org/nmaxwell/hdaf it may need more documentation. I don't mind if anyone provides constructive feedback on it.

The functionality you'd be interested in is hdaf.hdaf_periodic_samples, - you can convolve with that to low-pass filter, and use hertz_to_sigma to get the sigma parameter (from cutoff frequency).

  import numpy, scipy.special
  import matplotlib.pyplot as plt

  x=numpy.linspace(-2,5,1000)
  for n in [0,1,4,10]:
      s=1 if n==0 else n
      h=scipy.special.gammaincc(n+1,s*x**2)
      plt.plot(x, h)
  plt.savefig("filters")
EDIT: formatting.
Thanks for the insight! It's interesting to read a mathematician's more formal take on this -- I'm from an engineering background, and I've often seen the problem in control theory, where we approximate signals using observers and then differentiate the observed signal.

Edit: I should clarify that it's cool to read a mathematician's insight into the problem as seen by engineers, written in a way that engineers can understand.

Low pass filters are absolute miracles when working with numerical derivatives. I have had instances where that was the only reason I ever got anything meaningful out of them. It really comes down to the fact that the numerical representation of a function is in general not truly analytic. It is has kinks and bends all over the place.
Do you have any references for the approach that you recommended? You're saying that differentiation (not even just numerical differentiation) is "not something you want to apply to real-world data" ... This is surprising since differentiation obviously is used in the real-world in some very critical applications.
I believe you, but what applications do you have in mind?
[EDIT] As others have said below, this comment, while maybe useful, doesn't really apply to the article (or AD), since AD is about differentiating functions, not data.