Hacker News new | ask | show | jobs
by emn13 389 days ago
I get the feeling that the real problem here are the IEEE specs themselves. They include a huge bunch of restrictions that each individually aren't relevant to something like 99.9% of floating point code, and probably even in aggregate not a single one is relevant to a large majority of code segments out in the wild. That doesn't mean they're not important - but some of these features should have been locally opt-in, not opt out. And at the very least, standards need to evolve to support hardware realities of today.

Not being able to auto-vectorize seems like a pretty critical bug given hardware trends that have been going on for decades now; on the other hand sacrificing platform-independent determinism isn't a trivial cost to pay either.

I'm not familiar with the details of OpenCL and CUDA on this front - do they have some way to guarrantee a specific order-of-operations such that code always has a predictable result on all platforms and nevertheless parallelizes well on a GPU?

4 comments

Not being able to auto-vectorize is not the fault of the IEEE standard, but the fault of those programming languages which do not have ways to express that the order of some operations is irrelevant, so they may be executed concurrently.

Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

Because nowadays only a relatively small number of users care about computational applications, this defect has not been corrected in any mainline programming language, though for some programming languages there are extensions that can achieve this effect, e.g. OpenMP for C/C++ and Fortran. CUDA is similar to OpenMP, even if it has a very different syntax.

The IEEE standard for floating-point arithmetic has been one of the most useful standards in all history. The reason is that both hardware designers and naive programmers have always had the incentive to cheat in order to obtain better results in speed benchmarks, i.e. to introduce errors in the results with the hope that this will not matter for users, which will be more impressed by the great benchmark results.

There are always users who need correct results more than anything else and it can be even a matter of life and death. For the very limited in scope uses where correctness does not matter, i.e. mainly graphics and ML/AI, it is better to use dedicated accelerators, GPUs and NPUs, which are designed by prioritizing speed over correctness. For general-purpose CPUs, being not fully-compliant with the IEEE standard is a serious mistake, because in most cases the consequences of such a choice are impossible to predict, especially not by the people without experience in floating-point computation who are the most likely to attempt to bypass the standard.

Regarding CUDA, OpenMP and the like, by definition if some operations are parallelizable, then the order of their execution does not matter. If the order matters, then it is impossible to provide guarantees about the results, on any platform. If the order matters, it is the responsibility of the programmer to enforce it, by synchronization of the parallel threads, wherever necessary.

Whoever wants vectorized code should never rely on programming languages like C/C++ and the like, but they should always use one of the programming language extensions that have been developed for this purpose, e.g. OpenMP, CUDA, OpenCL, where vectorization is not left to chance.

If you care about absolute accuracy, I'm skeptical you want floats at all. I'm sure it depends on the use case.

Whether it's the standards fault or the languages fault for following the standard in terms of preventing auto-vectorization is splitting hairs; the whole point of the standard is to have predictable and usually fairly low-error ways of performing these operations, which only works when the order of operations is defined. That very aim is the problem; to the extent the stardard is harmless when ordering guarrantees don't exist you're essentially applying some of those tricky -ffast-math suboptimizations.

But to be clear in any case: there are obviously cases whereby order-of-operations is relevant enough and accuracy altering reorderings are not valid. It's just that those are rare enough that for many of these features I'd much prefer that to be the opt-in behavior, not opt-out. There's absolutely nothing wrong with having a classic IEEE 754 mode and I expect it's an essentialy feature in some niche corner cases.

However, given the obviously huge application of massively parallel processors and algorithms that accept rounding errors (or sometimes conversely overly precise results!), clearly most software is willing to generally accept rounding errors to be able to run efficiently on modern chips. It just so happens that none of the computer languages that rely on mapping floats to IEEE 754 floats in a straitforward fashion are any good at that, which is seems like its a bad trade off.

There could be multiple types of floats instead; or code-local flags that delineate special sections that need precise ordering; or perhaps even expressions that clarify how much error the user is willing to accept and then just let the compiler do some but not all transformations; and perhaps even other solutions.

> Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

We have memory ordering functions to let compilers know the atomic operation preference of the programmer… couldn’t we do the same for maths and in general a set of expressions?

An example of programming language syntax that avoids to specify sequential execution where not needed is to specify that a sequence of expressions separated by semicolons must be executed sequentially, but a sequence of expressions separated by commas may be executed in any order or concurrently.

This is just a minor change from the syntax of the most popular programming languages, because they typically already specify that the order of evaluation of the expressions used for the arguments of a function, which are separated by commas, can be arbitrary.

Early in its history, the C language has been close to specifying this behavior for its comma operator, but unfortunately its designers have changed their mind and they have made the comma operator behave like a semicolon, in order to be able to use it inside for statement headers, where the semicolons have a different meaning. A much better solution for C, instead of making both comma and semicolon to have the same behavior, would have been to allow a block to appear in any place where an expression is expected, giving it the value of the last expression evaluated in the block.

The precise requirements of IEEE-754 may not be important for any given program, but as long as you want your numbers to have any form of well-defined semantics beyond "numbers exist, and here's a list of functions that do Something™ that may or may not be related to their name", any number format that's capable of (approximately) storing both 10^20 and 10^-20 in 64 bits is gonna have those drawbacks.

AFAIK GPU code is basically always written as scalar code acting on each "thing" separately, that's, as a whole, semantically looped over by the hardware, same way as multithreading would (i.e. no order guaranteed at all), so you physically cannot write code that'd need operation reordering to vectorize. You just can't write an equivalent to "for (each element in list) accumulator += element;" (or, well, you can, by writing that and running just one thread of it, but that's gonna be slower than even the non-vectorized CPU equivalent (assuming the driver respects IEEE-754)).

A CUDA "kernel" is the same thing as what has been called "parallel DO" or "parallel FOR" since 1963, or perhaps even earlier.

This is slightly obfuscated by not using a keyword like "for" or "do", by the fact that the body of the loop (the "kernel") is written in one place and and the header of the loop (which gives the ranges for the loop indices) is written in another place, and by the fact that the loop indices have standard names.

A "parallel for" may have as well a syntax identical with a sequential "for". The difference is that for the "parallel for" the compiler knows that the iterations are independent, so they may be scheduled to be executed concurrently.

NVIDIA has been always greatly annoying by inventing a huge amount of new terms that are just new words for old terms that have been used for decades in the computing literature, with no apparent purpose except of obfuscating how their GPUs really work. Worse, AMD has imitated NVIDIA, by inventing their own terms that correspond to those used by NVIDIA, but they are once again different.

xargs does a parallel for too. And OFC Forth people might did that too in a breeze.
That's right and the same is done by the improved version of xargs, GNU "parallel".
How does IEEE 754 prevent auto-vectorisation?
The spec doesn’t prevent auto-vectorization, it only says the language should avoid it when it wants to opt in to producing “reproducible floating-point results” (section 11 of IEEE 754-2019). Vectorizing can be implemented in different ways, so whether a language avoids vectorizing in order to opt in to reproducible results is implementation dependent. It also depends on whether there is an option to not vectorize. If a language only had auto-vectorization, and the vectorization result was deterministic and reproducible, and if the language offered no serial mode, this could adhere to the IEEE spec. But since C++ (for example) offers serial reductions in debug & non-optimized code, and it wants to offer reproducible results, then it has to be careful about vectorizing without the user’s explicit consent.
If you write a loop `for x in array { sum += x }` Then your program is a specification that you want to add the elements in exactly that order, one by one. Vectorization would change the order.
The bigger problem there is the language not offering a way to signal the author’s intent. If an author doesn’t care about the order of operations in a sum, they will still write the exact same code as the author who does care. This is a failure of the language to be expressive enough, and doesn’t reflect on the IEEE spec. (The spec even does suggest that languages should offer and define these sorts of semantics.) Whether the program is specifying an order of operations is lost when the language offers no way for a coder to distinguish between caring about order and not caring. This is especially difficult since the vast majority of people don’t care and don’t consider their own code to be a specification on order of operations. Worse, most people would even be surprised and/or annoyed if the compiler didn’t do certain simplifications and constant folding, which change the results. The few cases where people do care about order can be extremely important, but they are rare nonetheless.
Yup, because of the imprecision of floating points, cannot just assume that “(a + c) + (b + d)” is the same as “a + b + c + d”.

It would be pretty ironic if at some point fixed point / bignum implementations end up being faster because of this.

They are, just check anything fixed-point for the 486SX vs anything floating under a 486DX. It's faster scaling and sum and print the desired precision than operating on floats.
Is that also the case for modern architectures? Eg is there SIMD fixed precision?
I wonder... couldn't there just be some library type for this, e.g. `associative::float` and `associative::doube` and such (in C++ terms), so that compilers can ignore non-associativity for actions on values of these types? Or attributes one can place on variables to force assumption of associativity?
IIRC reordering additions can cause the result to change which makes auto-vectorisation tricky.
Floating point arithmetic is neither commutative or associative so you shouldn’t.
While it technically correct to say this it also gets the wrong point across because it leaves out the fact that ordering changes create only a small difference. Other examples where arithmetic is not commutative, e.g. matrix multiplication , can create much larger differences.
> ordering changes create only a small difference.

That can’t be assumed.

You can easily fall into a situation like:

  total = large_float_value
  for _ in range(1_000_000_000):
    total += .01
  assert total == large_float_value
Without knowing the specific situation, it’s impossible to say whether that’s a tolerably small difference.
Floating-point arithmetic is non-associative, but it is commutative for the operations that are algebraically commutative: x + y == y + x and x*y == y*x. And x - y = -(y - x) so subtraction is properly anti-commutative.

The only very marginal exception to this is that when both arguments are NaN, the return value will be NaN, but which NaN payload is returned can depend on argument order. But no one ever uses this because it's not specified, so it can't be used reliably for anything useful. The behavior I wish IEEE 754 had specified for this is to define a standard NaN value (or two), and when the return value of an op is NaN, and some of the arguments are non-standard NaNs, then one of those non-standard NaN values must be returned. This doesn't depend on argument order and allows NaN payloads to be reliably propagated, which would let you encode useful debugging information in NaN payloads and know that it will flow through the program.

IEEE-754 addition and multiplication is commutative. It isn't distributive, though.
Why is it not commutative?
It actually is commutative according to IEEE-754, except that in the case of a NaN result you might get a different NaN representation.
having multiple NaNs and no spec for how they should behave feels like such an unforced error to me
For mathematical use, NaN payloads shouldn’t matter, and behave identically (aside from quiet vs. signaling NaNs). It also doesn’t matter for equality comparison, because NaNs always compare unequal.
> I get the feeling that the real problem here are the IEEE specs themselves.

Well, all standards are bad when you really get into them, sure.

But no, the problem here is that floating point code is often sensitive to precision errors. Relying on rigorous adherence to a specification doesn't fix precision errors, but it does guarantee that software behavior in the face of them is deterministic. Which 90%+ of the time is enough to let you ignore the problem as a "tuning" thing.

But no, precision errors are bugs. And the proper treatment for bugs is to fix the bugs and not ignore them via tricks with determinism. But that's hard, as it often involves design decisions and complicated math (consider gimbal lock: "fixing" that requires understanding quaternions or some other orthogonal orientation space, and that's hard!).

So we just deal with it. But IMHO --ffast-math is more good than bad, and projects should absolutely enable it, because the "problems" it discovers are bugs you want to fix anyway.

> (consider gimbal lock: "fixing" that requires understanding quaternions or some other orthogonal orientation space, and that's hard!)

Or just avoiding gimbal lock by other means. We went to the moon using Euler angles, but I don't suppose there's much of a choice when you're using real mechanical gimbals.

That is the "tuning" solution. And mostly it works by limiting scope of execution ("just don't do that") and if that doesn't work by having some kind of recovery method ("push this button to reset", probably along with "use this backup to recalibrate"). And it... works. But the bug is still a bug. In software we prefer more robust techniques.

FWIW, my memory is that this was exactly what happened with Apollo 13. It lost its gyro calibration after the accident (it did the thing that was the "just don't do that") and they had to do a bunch of iterative contortions to recover it from things like the sun position (because they couldn't see stars out the iced-over windows).

NASA would have strongly preferred IEEE doubles and quaternions, in hindsight.