Hacker News new | ask | show | jobs
by Ragib_Zaman 2651 days ago
An 'online' mean calculation might resolve some of the common issues. Something like -

  def mean(ls):
      mu = 0.0
      for i, x in enumerate(ls):
          mu = i/(i+1) * mu + x/(i+1)
      return mu
4 comments

Even better if you can afford to sort the data and use a high precision type for mu. The algorithm will not be online any more and for python real numbers are usually double in practice. But in a constraint system where one has to work with float32, just using float64 for mu helps.

    mu = mu + (x - mu) / (i+1)
This is more natural when written in C:

    mu += (x - mu) / (i+1)
How much precision loss is to be expected here?
Good point. The precision loss will be pretty bad if the numbers added above have greatly different magnitudes, which could occur if the next number (x) is very different to the current mean (mu), or if the list is very long. The first issue could be partially mitigated by sorting the list first. If the list is very long, the above algorithm will necessarily add numbers of quite different magnitudes, but employing something like Kahan summation - https://en.wikipedia.org/wiki/Kahan_summation_algorithm , could partially mitigate that issue as well.
It is proportional to the number of terms in the sum. Kahan summation has a tighter error bound.
How is that different from the second example?
That you don't have to know the number of elements to sum at start. But I think the precision loss is even greater this way.