| The IEEE floats do not have any serious design issue. They are much better than any other floating-point formats that have ever been used. There have been many examples of badly designed floating-point numbers, e.g. those used by the IBM mainframes and those used by the DEC minicomputers. In both cases, minimizing the costs for computer vendors was prioritized over the properties needed by the end users. In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal. Only logarithms would be better than IEEE floats, but the addition and subtraction of numbers stored as logarithms is too slow, so the IEEE floating-point numbers are a compromise between the speed of multiplication and the speed of addition. Posits redistribute the relative error over the range of numbers, increasing the precision of numbers close to unity by reducing the precision of numbers far from unity. This is a property that may seem desirable for input data and output data, but it is extremely undesirable for the intermediate values that are generated in computations. Even for input data and output data, in engineering there are many characteristic values of parts used in manufacturing, such as electronic components, where constant relative errors are needed over a range greater than 10^9 or even greater than 10^12 (e.g. the values of standard resistors, with a given tolerance, may vary from milliohms to gigaohms and those of standard capacitors from femtofarads to farads; it would not be acceptable to represent such nominal values as posits, with a tolerance depending on the nominal value). There may exist some applications where this is useful, but all such applications are among those that need low precision numbers, of 32 bits or less. So 16-bit or even 32-bit posits might be useful in certain circumstances, but it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers. The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications makes unlikely that the development of dedicated fast hardware for them can be worthwhile. Even if some applications might indeed like a better precision for numbers close to unity, for those applications posits must contend not only with floating-point numbers but also with fixed-point numbers, which do not need any special hardware and they can be implemented in any standard CPU with a simple software library. Fixed-point numbers have even better precision than posits for numbers close to unity, but posits have a greater range, where their precision drops quickly towards the extremities of the range. So the application domain of posits is squeezed between those of floating-point numbers and those of fixed-point numbers, leaving a very small number of applications where posits are optimal. I am not aware of any application that would justify the additional cost for posit-processing hardware. I think that the only chance for posits would be if someone would show that some posit format is better for certain ML/AI/LLM computations than narrow FP formats like BF16, FP8, FP4. That will be the only use case that could find people willing to pay enough money for the design of posit-processing hardware, in some kind of NPU for training or for inference. Like I have said, for technical/scientific computing posits are much inferior to FP64, graphic applications are happy with FP32 and FP16 and they do not have any reason for a change, while the applications that need high precision around unity are happy with fixed-point numbers and they also do not have any incentive to change. |
It's obvious that IEEE floats have several issues: low precision, excessive dynamic range, underflow to 0 causing infinite loss of precision (ditto for overflow), denormals, non-portability, lack of a total order, redundancy of NaN bit patterns, poor scaling to low precision... I can go on! Whether you regard those as relevant or as unimportant, or perhaps as unavoidable, is your opinion. But they do exist!
To re-iterate: claiming that IEEE floats are superior to any existing or proposed alternative is a claim you can attempt to make, although I disagree with it. But claiming "IEEE floats do not have any problems whatsoever" is simply not a good-faith conversation...
> In scientific and technical computations, a constant relative error over the entire range of the numbers is optimal.
What do you possibly mean by "optimal"? For instance, inference on a neural network using a 16-bit (or even 8-bit!) posit type tailored to the distribution of the weights in that neural network can yield better results than with 32-bit floats! Obviously floats are not "optimal" in any possible conceivable situation (neither are posits "more optimal" than floats in any conceivable situation).
Even in "traditional" HPC applications, like weather modelling, experiments have shown 16-bit posits to be acceptable replacements for 32 or even 64-bit computations.
> Like I have said, for technical/scientific computing posits are much inferior to FP64
Repeating something does not make it true :) Like I have said, 32 and 16-bit posits can replace 64 and 32-bit floats in many important applications (obviously not all). HPC and ML workloads are largely memory-bound nowadays; halving the number of bits can yield a doubling of performance, roughly speaking.
> it is pretty certain that 64-bit or bigger posits will always be inferior to IEEE floating-point numbers.
> The problem is that while the IEEE floating-point numbers are either optimal or at least acceptable in almost all applications, the fact that posits might be better only in certain special applications
You assert this repeatedly as dogma, without proof x) I don't get it, do you have a dog in this race somehow? Bizarre.
In fact, the simplest way to see that this is wrong is to consider a "posit"-like format with no regime bits, but with otherwise the same structure: twos-complement representation, no underflow or overflow, deterministic rounding, a quire. This format is essentially an improved version of an IEEE float, without most of its warts, but still a constant relative error (actually constant, unlike IEEE with its subnormals!), similar hardware encode/decode implementations, etc.