Hacker News new | ask | show | jobs
by phdelightful 583 days ago
I have worked on a performance-portable math library. We implement BLAS, sparse matrix operations, a variety of solvers, some ODE stuff, and various utilities for a variety of serial and parallel execution modes on x86, ARM, and the three major GPU vendors.

The simplest and highest-impact tests are all the edge cases - if an input matrix/vector/scalar is 0/1/-1/NaN, that usually tells you a lot about what the outputs should be.

It can be difficult to determine sensible numerical limit for error in the algorithms. The simplest example is a dot product - summing floats is not associative, so doing it in parallel is not bitwise the same as serial. For dot in particular it's relatively easy to come up with an error bound, but for anything more complicated it takes a particular expertise that is not always available. This has been a work in progress, and sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

Solvers are tested using analytical solutions and by inverting them, e.g. if we're solving Ax = y, for x, then Ax should come out "close" to the original y (see error tolerance discussion above).

One of the most surprising things to me is that the suite has identified many bugs in vendor math libraries (OpenBLAS, MKL, cuSparse, rocSparse, etc.) - a major component of what we do is wrap up these vendor libraries in a common interface so our users don't have to do any work when they switch supercomputers, so in practice we test them all pretty thoroughly as well. Maybe I can let OpenBLAS off the hook due to the wide variety of systems they support, but I expected the other vendors would do a better job since they're better-resourced.

For this reason we find regression tests to be useful as well.

3 comments

> sometimes (usually) we just picked a magic tolerance out of thin air that seems to work.

Probably worth mentioning that in general the tolerance should be relative error, not absolute, for floating point math. Absolute error tolerance should only be used when there’s a maximum limit on the magnitude of inputs, or the problem has been analyzed and understood.

I know that doesn’t stop people from just throwing in 1e-6 all over the place, just like the article did. (Hey I do it too!) But if the problem hasn’t been analyzed then an absolute error tolerance is just a bug waiting to happen. It might seem to work at first, but then catch and confuse someone as soon as the tests use bigger numbers. Or maybe worse, fail to catch a bug when they start using smaller numbers.

But then relative error is also not a panacea. If I compute 1 + 1e9, then producing 1e9 - 1 instead would fall within a relative error bound of 1e-6 easily. More generally, relative error works only if your computation scales "multiplicatively" from zero; if there's any additive component, it's suspect.

Of course, as you say, absolute error is also crap in general: it's overly restrictive for large inputs and overly permissive for small ones.

I'm not a numerics person, but I do end up needing to decide on something sensible for error bounds on computations sometimes. How does one do this properly? Interval arithmetic or something?

This is what ULPs are for (https://en.wikipedia.org/wiki/Unit_in_the_last_place).

It's easier for most building blocks (like transcendental functions) to be discussed in terms of worst case ULP error (e.g., <= 1 everywhere, <= 3, etc.). For example, SPIR-V / OpenCL has this section on the requirements to meet the OpenCL 3.0 spec (https://registry.khronos.org/OpenCL/specs/3.0-unified/html/O...). NVIDIA includes per-PTX-revision ULP information in their docs (e.g., https://docs.nvidia.com/cuda/parallel-thread-execution/#floa... for Divide and https://docs.nvidia.com/cuda/parallel-thread-execution/#floa... more broadly).

> More generally, relative error works only if your computation scales “multiplicatively” from zero; if there’s any additive component, it’s suspect.

IEEE floating point inherently scales from zero, the absolute error in any computation is proportional to the magnitude of the input numbers, whether you’re adding or multiplying or doing something else. It’s the reason that subtracting two large numbers has a higher relative error than subtracting two small numbers, c.f. catastrophic cancellation.

> How does one do this properly?

There’s a little bit of an art to it, but you can start by noting the actual result of any accurate operation has a maximum error of 0.5 LSB (least significant bit) simply as a byproduct of having to store the result in 32 or 64 bits; essentially just think about every single math instruction being required to round the result so it can fit into a register. Now write an expression for your operation in terms of perfect math. If I’m adding two numbers it will look like a[+]b = (a + b)*(1 + e), where e is your 0.5 LSB epsilon value. For 32 bit float, e == +/- 1e^-24. In this case I differentiate between digital addition with a finite precision result, and perfect addition, using [+] for digital addition and + for perfect addition.

This gets hairy and you need more tricks for anything complicated, but multiplying each and every operation by (1+e) is the first step. It quickly becomes apparent that the maximum error is bounded by |e| * (|a|+|b|) for addition or |e| * (|a| * |b|) for multiply… substitute whatever your operation is.

When doing more complicated order-dependent error analysis, it’s helpful to use bounds and to allow error estimates to grow slightly in order to simplify expressions. This way you can prove the error is less than a certain expression, but the expression might be conservative.

A 3d dot product is a good example to work though using (1+e). Typically it’s reasonable to drop e^2 terms, even though it will technically compromise your error bound proof by some minuscule amount.

    a[*]x [+] b[*]y [+] c[*]z = ((ax(1+e) + bx(1+e)) + cx(1+e))(1+e)
    = ((ax+axe + by+bye)(1+e) + cz+cze)(1+e)
    = (ax+by+(ax+by)e + axe+bye+(ax+by)ee + cz+cze)(1+e)
    = (ax + bx + 2e(ax+by) + e^2(ax+by) + cz+cze)(1+e)
    = ax + by + 2e(ax+by) + e^2(ax+by) + cz + cze + axe + bye + 2e^2(ax+by) + e^3(ax+by) + cze + cze^2
    = ax+by+cz + 3e(ax+by) + 2e(cz) + 3e^2(ax+by) + e^2cz + e^3(ax+by)
Now drop all the higher order terms of e.

    = ax+by+cz + 3e(ax+by) + 2e(cz)
Now also notice that 2e|cz| <= 3e|cz|, so we can say the total error bound:

    <= (ax + by + cz) + 3e( |a||x| + |b||y| + |c||z| )
And despite the intermediate mess, this suddenly looks very conceptually simple and doesn’t depend on the order of operations. If the input values are all positive, then we can say the error is proportional to 3 times the magnitude of the dot product. And it’s logical too because we stacked 3 math operations, one multiply for each element of the sum and two adds.

Sorry if that was way too much detail… I got carried away. :P I glossed over some topics, and there could be mistakes but that’s the gist. I’ve had to do this for my work on a much more complicated example, and it took a few tries. There is a good linear algebra book about this, I think called Accuracy and Stability of Numerical Algorithms (Nicholas Higham). The famous PBR graphics book by Pharr et. al. also talks about error estimation techniques.

Right, this is if you know the exact operations that your computation does, and that list is small enough.

My usecase is testing an autodiff algorithm. So I have larger programs (for which doing this process would be quite cumbersome already), and then run them through a code transformation that makes it compute a gradient. What's an appropriate error bound for that gradient?

Ideally I would even want to be able to randomly generate input programs, differentiate them, and test correctness of the computed gradient. I feel like generalising your approach (in particular the dropping of higher-order e terms) smacks of interval arithmetic, but even with a proper error estimate based on interval arithmetic, one would have to incorporate an estimate of the accuracy of their derivatives, too.

And to make life harder, I'd like to do this in a parallel setting where e.g. reductions (sums, products, etc.) have non-deterministic order to improve parallelism. I don't know how to approach this!

The Higham book does work toward error analysis of matrix multiplies, it would be useful to see how that’s done.

In the case of autodiff, you do presumably know the exact computations that are done, there just might be so many of them that it’s infeasible to work it out analytically.

It depends on your requirements, so I’m not sure if this suggestion will work for you, but one strategy to consider would be to build the error bound computation as a function into your math operations. It’s relatively much easier to compute error bounds than it is to write an expression for them or to prove them. That strategy won’t give you conservative bounds and if your input is non-deterministic, the answer will vary on every run. But you could sample your error bounds enough times to have some confidence in the statistical answer.

I’m assuming in both paragraphs above that you have control over the autodiff implementation and can modify it. If that’s not true, if it’s not yours and not open source, then the only alternative is to ask the maintainer.

IIRC this is what the PBR book does, it weaves an error bound function into the base class of a math operation and then you can query the error from a parse tree of different math ops, or something like that.

Context: I'm writing an autodiff algorithm, and that's the thing I want to test.

> one strategy to consider would be to build the error bound computation as a function into your math operations

That's autodiff! :D The error (change) in a function's output given an error (change) in its input is the definition of its derivative.

So I guess I can use my autodiff algorithm to compute the error bounds for testing my autodiff algorithm! ... Uhm.

Actually, though, I want _one_ error bound, not one that varies per run. But I guess you can do the same thing symbolically: you just end up with

1. a slightly-too-optimistic bound because you will (for practicality) discard higher-order e terms;

2. a conservative bound because you'll be pessimistic in the face of control flow, array sizes, etc. where the precise operations performed depend on the input.

I guess this symbolic approach works until you have unbounded sequential loops (which pessimistically have unbounded error, because it may accumulate indefinitely). Or perhaps it breaks down already with arbitrary-size arrays; what is rhe error bound on `lambda x. sum([x]*n)`, assuming n is unknown? (Using python syntax as "universal syntax".)

> then you can query the error from a parse tree of different math ops, or something like that.

If there is no dynamic control flow nor variable-size data structures etc. in that parse tree, I suspect they do kind of what I hand-wavingly described above.

It's not perfect enough that I'm going to implement this approach immediately (variable-size arrays are kind of core to what I want to do). But perhaps there's some trick I can pull out of this. Thanks for the ideas!

I've also been surprised many times by issues in numerical libraries. In addition to matrices with simple entries, I've found plenty of bugs just testing small matrices, with dimensions in {0,1,2,3,4}. Many libraries/routines fall over when the matrix is small, especially when one dimension is 0 or 1.

Presently, I am working on cuSPARSE and I'm very keen to improve its testing and correctness. I would appreciate anything more you can share about bugs you've seen in cuSPARSE. Feel free to email me, eedwards at nvidia.com

This is one of the reasons I argue that it's almost always better to prioritize speed and stability than accuracy specifically. No one actually knows what their thresholds are (including library authors), but the sky isn't falling despite that. Instabilities and nondeterminism will blow up a test suite pretty quickly though.
> No one actually knows what their thresholds are (including library authors)

If low-level numerical libraries provided documentation for their accuracy guarantees, it would make it easier to develop software on top of those libraries. I think numerical libraries should be doing this, when possible. It's already common for special-function (e.g. sin, cos, sqrt) libraries to specify their accuracy in ULPs. It's less common for linear algebra libraries to specify their accuracy, but it's still quite doable for BLAS-like operations.

What I'm trying to convey is that the required accuracies for the application are what's unclear. To give an example of a case where accuracy matters, I regularly catch computational geometry folks writing code that branches differently on positive, negative, and 0 results. That application implies 0.5 ulp, which obviously doesn't match the actual implementation accuracy even if it's properly specified, so there's usually a follow-up conversation trying to understand what they really need and helping them achieve it.
Yeah, we really just try to come up with very loose bounds since the analysis is hard. Even so, it does occasionally stop us from getting things way way wrong.
Nx0, 0xN, and 0x0 matrices are great edge cases.

0-length vectors, too.

Then, do the same with Nx1, 1xN, and 1x1 matrices.

100% agree that picking numeric tolerances can be tricky. It is especially tricky when writing a generic math library like you are. If your doing something more applied it can help to take limits from your domain. For the example in the blog, if you're using GPS to determine you're position on earth, you probably know how precise physics allows that answer to be, and you only need to test to that tolerance (or an order of magnitude more strict, to give some wiggle room.)