Hacker News new | ask | show | jobs
by StefanKarpinski 382 days ago
Pretty sure that’s not possible. More accurate for some inputs will be less accurate for others. There’s a very tricky tension in float optimization that the most predictable operation structure is a fully skewed op tree, as in naive left-to-right summation, but this is the slowest and least accurate order of operations. Using a more balanced tree is faster and more accurate (great), but unfortunately which tree shape is fastest depends very much on hardware-specific factors like SIMD width (less great). And no tree shape is universally guaranteed to be fully accurate, although a full binary tree tends to have the best accuracy, but has bad base case performance, so the actual shape that tends to get used in high performance kernels is SIMD-width parallel in a loop up to some fixed size like 256 elements, then pairwise recursive reduction above that. The recursive reduction can also be threaded. Anyway, there’s no silver bullet here.
1 comments

I think a restricted version might be possible to implement: only allow transformations if the transformed version has strictly fewer numerical rounding errors on some inputs. This will usually only mean canceling terms and collecting expressions like "x+x+x" into 3x.

In general, rules that allow fewer transformations are probably easier to understand and use. Trying to optimize everything is where you run into trouble.