| HN Mirror

>It's obvious that IEEE floats have several issues: low precision, excessive dynamic range, underflow to 0 causing infinite loss of precision (ditto for overflow), denormals, non-portability, lack of a total order, redundancy of NaN bit patterns, poor scaling to low precision... I can go on!

From your list of supposed defects, only one can be considered a defect, i.e. that there are too many NaN values. However there is a justification for this, because using less values as NaNs would have required more expensive hardware for detecting which operands are NaNs.

> low precision, excessive dynamic range

This is a single item, because choosing the dynamic range determines the precision and vice versa.

Contrary to your claim, a greater dynamic range was one of the greatest improvements of the IEEE 754 format over the earlier FP formats, like that of IBM and that of DEC.

With a lower dynamic range, overflows in intermediate computation results become extremely frequent and unavoidable. This was a major problem before Intel 8087 and the IEEE 754 standard. The dynamic range of the IEEE FP64 format is great enough to make overflows very unlikely in typical technical/engineering computations, and this is a very desirable property.

> underflow to 0 causing infinite loss of precision

I do not know why you have written this false statement, but there exists no such thing in the IEEE standard for floating-point arithmetic.

Underflows have only 2 standard behaviors, they either generate exceptions that must be handled by the programmer or they generate denormal numbers, which minimize the loss of precision. It is impossible to have "infinite loss of precision" on underflows in a standard-conforming processor.

> ditto for overflow

Overflows have 2 possible standard behaviors, they can generate either exceptions or infinities. Both possible behaviors allow the programmer to detect that there is a bug in the program, which must be fixed. There exist no methods of handling overflow that can avoid the loss of precision (when using fixed-length numbers), so the only thing that can be done, and it is done by the standard, is to ensure that the user is made aware that an overflow happened.

> denormals

Denormals are an optional feature of the standard, like infinities and NaNs. They can be completely avoided by enabling the underflow exception.

Denormals offer a choice to the programmer, to avoid handling the underflow exceptions. If the programmer chooses to use denormals, they minimize the loss of precision when underflows happen.

> non-portability

Huh ?!

The IEEE FP numbers are the most portable FP format known in history. Before this standard, every computer-making company had their own FP formats that were incompatible. The conversion of numeric data between different computers was a very complex problem.

Now this is far in the past. Even if some processors, mainly GPUs, do not support all features of the standard, at least the numeric formats are everywhere the same so no conversions are required.

> poor scaling to low precision

This has nothing to do with the IEEE standard. Any kind of floating-point numbers must scale poorly towards very low precision, because when very few bits are available for the complete number, then even fewer bits can be used for the significand and for the exponent, which makes difficult the tradeoff between precision and dynamic range.

Even so, the IEEE standard FP16 format has adequate precision and dynamic range for its main application, which is storing color component values in pixels, in graphics and video applications.

For ML/AI applications, where even lower precision is desired, 8-bit to 4-bit floating-point numeric formats have become preferred to fixed-point numbers, despite the "poor scaling" of floating-point numbers to low precision.

It should also be noted that when FP numbers are scaled to low precision the use of denormals becomes absolutely necessary for avoiding the loss of precision around zero. That is why more than a decade before the IEEE 754 standard, denormals were used in the 8-bit floating-point numbers used to encode audio samples in digital telephony, in USA/Japan and in Europe (mu-law and A-law). Denormals were used first in computing by Intel 8087, but they were already established in low-precision floating-point numbers.

So all your list contains no defect of the IEEE standard floating-point numbers, especially not in comparison with other FP number formats.

The only point that is specific to the IEEE format and about which one could argue regarding some specific application is the choice of the dynamic ranges, which for FP64 and FP128 are greater than those used in the older FP formats, which had been in use before 1980.

I have started using computers with IBM mainframes and DEC minicomputers, so I have practical experience with those older formats.

Switching to IBM PCs with 8087 and their successors was a great improvement by eliminating the problem of overflows. On older computers, in order to avoid overflows it was frequently necessary to introduce well chosen scale factors in various formulae and equations. The necessity of handling those scale factors removed much of the advantage that floating-point numbers have over fixed-point numbers. Floating-point numbers were invented precisely to free the programmers from the chores of having to deal with scale factors.

For solving practical engineering problems, like the design of electronic devices or integrated circuits, the dynamic range of IEEE FP64 numbers is good and any of the dynamic ranges of older FP formats (like also the dynamic range of IEEE FP32 numbers) is insufficient.