| > More generally, relative error works only if your computation scales “multiplicatively” from zero; if there’s any additive component, it’s suspect. IEEE floating point inherently scales from zero, the absolute error in any computation is proportional to the magnitude of the input numbers, whether you’re adding or multiplying or doing something else. It’s the reason that subtracting two large numbers has a higher relative error than subtracting two small numbers, c.f. catastrophic cancellation. > How does one do this properly? There’s a little bit of an art to it, but you can start by noting the actual result of any accurate operation has a maximum error of 0.5 LSB (least significant bit) simply as a byproduct of having to store the result in 32 or 64 bits; essentially just think about every single math instruction being required to round the result so it can fit into a register. Now write an expression for your operation in terms of perfect math. If I’m adding two numbers it will look like a[+]b = (a + b)*(1 + e), where e is your 0.5 LSB epsilon value. For 32 bit float, e == +/- 1e^-24. In this case I differentiate between digital addition with a finite precision result, and perfect addition, using [+] for digital addition and + for perfect addition. This gets hairy and you need more tricks for anything complicated, but multiplying each and every operation by (1+e) is the first step. It quickly becomes apparent that the maximum error is bounded by |e| * (|a|+|b|) for addition or |e| * (|a| * |b|) for multiply… substitute whatever your operation is. When doing more complicated order-dependent error analysis, it’s helpful to use bounds and to allow error estimates to grow slightly in order to simplify expressions. This way you can prove the error is less than a certain expression, but the expression might be conservative. A 3d dot product is a good example to work though using (1+e). Typically it’s reasonable to drop e^2 terms, even though it will technically compromise your error bound proof by some minuscule amount. a[*]x [+] b[*]y [+] c[*]z = ((ax(1+e) + bx(1+e)) + cx(1+e))(1+e)
= ((ax+axe + by+bye)(1+e) + cz+cze)(1+e)
= (ax+by+(ax+by)e + axe+bye+(ax+by)ee + cz+cze)(1+e)
= (ax + bx + 2e(ax+by) + e^2(ax+by) + cz+cze)(1+e)
= ax + by + 2e(ax+by) + e^2(ax+by) + cz + cze + axe + bye + 2e^2(ax+by) + e^3(ax+by) + cze + cze^2
= ax+by+cz + 3e(ax+by) + 2e(cz) + 3e^2(ax+by) + e^2cz + e^3(ax+by)
Now drop all the higher order terms of e. = ax+by+cz + 3e(ax+by) + 2e(cz)
Now also notice that 2e|cz| <= 3e|cz|, so we can say the total error bound: <= (ax + by + cz) + 3e( |a||x| + |b||y| + |c||z| )
And despite the intermediate mess, this suddenly looks very conceptually simple and doesn’t depend on the order of operations. If the input values are all positive, then we can say the error is proportional to 3 times the magnitude of the dot product. And it’s logical too because we stacked 3 math operations, one multiply for each element of the sum and two adds.Sorry if that was way too much detail… I got carried away. :P I glossed over some topics, and there could be mistakes but that’s the gist. I’ve had to do this for my work on a much more complicated example, and it took a few tries. There is a good linear algebra book about this, I think called Accuracy and Stability of Numerical Algorithms (Nicholas Higham). The famous PBR graphics book by Pharr et. al. also talks about error estimation techniques. |
My usecase is testing an autodiff algorithm. So I have larger programs (for which doing this process would be quite cumbersome already), and then run them through a code transformation that makes it compute a gradient. What's an appropriate error bound for that gradient?
Ideally I would even want to be able to randomly generate input programs, differentiate them, and test correctness of the computed gradient. I feel like generalising your approach (in particular the dropping of higher-order e terms) smacks of interval arithmetic, but even with a proper error estimate based on interval arithmetic, one would have to incorporate an estimate of the accuracy of their derivatives, too.
And to make life harder, I'd like to do this in a parallel setting where e.g. reductions (sums, products, etc.) have non-deterministic order to improve parallelism. I don't know how to approach this!