Hacker News new | ask | show | jobs
by adrian_b 382 days ago
Not being able to auto-vectorize is not the fault of the IEEE standard, but the fault of those programming languages which do not have ways to express that the order of some operations is irrelevant, so they may be executed concurrently.

Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

Because nowadays only a relatively small number of users care about computational applications, this defect has not been corrected in any mainline programming language, though for some programming languages there are extensions that can achieve this effect, e.g. OpenMP for C/C++ and Fortran. CUDA is similar to OpenMP, even if it has a very different syntax.

The IEEE standard for floating-point arithmetic has been one of the most useful standards in all history. The reason is that both hardware designers and naive programmers have always had the incentive to cheat in order to obtain better results in speed benchmarks, i.e. to introduce errors in the results with the hope that this will not matter for users, which will be more impressed by the great benchmark results.

There are always users who need correct results more than anything else and it can be even a matter of life and death. For the very limited in scope uses where correctness does not matter, i.e. mainly graphics and ML/AI, it is better to use dedicated accelerators, GPUs and NPUs, which are designed by prioritizing speed over correctness. For general-purpose CPUs, being not fully-compliant with the IEEE standard is a serious mistake, because in most cases the consequences of such a choice are impossible to predict, especially not by the people without experience in floating-point computation who are the most likely to attempt to bypass the standard.

Regarding CUDA, OpenMP and the like, by definition if some operations are parallelizable, then the order of their execution does not matter. If the order matters, then it is impossible to provide guarantees about the results, on any platform. If the order matters, it is the responsibility of the programmer to enforce it, by synchronization of the parallel threads, wherever necessary.

Whoever wants vectorized code should never rely on programming languages like C/C++ and the like, but they should always use one of the programming language extensions that have been developed for this purpose, e.g. OpenMP, CUDA, OpenCL, where vectorization is not left to chance.

2 comments

If you care about absolute accuracy, I'm skeptical you want floats at all. I'm sure it depends on the use case.

Whether it's the standards fault or the languages fault for following the standard in terms of preventing auto-vectorization is splitting hairs; the whole point of the standard is to have predictable and usually fairly low-error ways of performing these operations, which only works when the order of operations is defined. That very aim is the problem; to the extent the stardard is harmless when ordering guarrantees don't exist you're essentially applying some of those tricky -ffast-math suboptimizations.

But to be clear in any case: there are obviously cases whereby order-of-operations is relevant enough and accuracy altering reorderings are not valid. It's just that those are rare enough that for many of these features I'd much prefer that to be the opt-in behavior, not opt-out. There's absolutely nothing wrong with having a classic IEEE 754 mode and I expect it's an essentialy feature in some niche corner cases.

However, given the obviously huge application of massively parallel processors and algorithms that accept rounding errors (or sometimes conversely overly precise results!), clearly most software is willing to generally accept rounding errors to be able to run efficiently on modern chips. It just so happens that none of the computer languages that rely on mapping floats to IEEE 754 floats in a straitforward fashion are any good at that, which is seems like its a bad trade off.

There could be multiple types of floats instead; or code-local flags that delineate special sections that need precise ordering; or perhaps even expressions that clarify how much error the user is willing to accept and then just let the compiler do some but not all transformations; and perhaps even other solutions.

> Most popular programming languages have the defect that they impose a sequential semantics even where it is not needed. There have been programming languages without this defect, e.g. Occam, but they have not become widespread.

We have memory ordering functions to let compilers know the atomic operation preference of the programmer… couldn’t we do the same for maths and in general a set of expressions?

An example of programming language syntax that avoids to specify sequential execution where not needed is to specify that a sequence of expressions separated by semicolons must be executed sequentially, but a sequence of expressions separated by commas may be executed in any order or concurrently.

This is just a minor change from the syntax of the most popular programming languages, because they typically already specify that the order of evaluation of the expressions used for the arguments of a function, which are separated by commas, can be arbitrary.

Early in its history, the C language has been close to specifying this behavior for its comma operator, but unfortunately its designers have changed their mind and they have made the comma operator behave like a semicolon, in order to be able to use it inside for statement headers, where the semicolons have a different meaning. A much better solution for C, instead of making both comma and semicolon to have the same behavior, would have been to allow a block to appear in any place where an expression is expected, giving it the value of the last expression evaluated in the block.