Hacker News new | ask | show | jobs
by imurray 4673 days ago
For casual purposes st may be convenient, but it doesn't have state of the art numerical stability:

    my $variance = $count > 1 ? ($sum_square - ($sum**2/$count)) / ($count-1) : undef;
Taking the difference between two similar numbers loses precision, and in extreme cases squaring the raw numbers could cause overflow. For comparison, see the recently posted: http://www.python.org/dev/peps/pep-0450/ and https://en.wikipedia.org/wiki/Algorithms_for_calculating_var...
2 comments

Thanks! I changed the algorithm to online variance, hope it is more stable:

https://github.com/nferraz/st/commit/d0fb1bf814fc5940c5aae39...

If you need numerical stability, I'd use Gary Perlmann's [|stat](http://oldwww.acm.org/perlman/stat/history.html). It's older and somewhat harder to get a copy of, but it's as reliably correct as a piece of software can be...
I've just had a look at the |STAT source, and it computes the variance with

        double  M       = Sum/N;                /* mean */
        double  var     = (s2 - M*Sum)/(N-1);   /* variance */
where s2 is the sum of squares. In most reasonable situations, this approach will work fine. It just doesn't take as wide a variety of inputs as is easily possible to achieve with normal floating point doubles. In fairness, the |STAT terms and conditions state:

|STAT PROGRAMS HAVE NOT BEEN VALIDATED FOR LARGE DATASETS, HIGHLY VARIABLE DATA, NOR VERY LARGE NUMBERS.