| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jlgustafson 1698 days ago

I hope I'm not too late to the party to correct some things I see here. The big accomplishment of Kahan and IEEE 754 was to get companies to agree on where the sign, exponent, and fraction should go, so that data interchange finally became possible between different computer brands.

Kahan wanted decimal floats, not binary, and he wanted Extended Precision to be 128, not 80. I've had many hours of conversation with the man about how Intel railroaded that Standard to express the design decisions that had already been made for the i8087 coprocessor. John Palmer, who I also worked with for years, was proud of this, and told me "Whatever the i8087 is, THAT is the IEEE Standard."

Posits have a single exception value, Not a Real (NaR) for all things that fall through the protections of C and Java and all the other modern languages for things like division by zero, and the square root of a negative value. Kahan wanted the quadrillions of Not a Number (NaN) patterns to be used to encode the address of the instruction in the program to pinpoint where it happened, but the support for this in computing languages never happened. By around 2005, vendors noticed they could trap the exceptions and spend hundreds of clock cycles handling them with microcode or software, so the FLOPS claims only applied to normal floats, not subnormals or NaN or infinities, etc. This is true today for all x86 and ARM processors, and SPARC for that matter. Only the POWER series from IBM can still claim to support IEEE 754 in hardware; hardware support for IEEE 754 is all but extinct.

There are over a hundred papers published comparing posits and floats, both for accuracy on applications and difficulty of implementation. LLNL and Oxford U have definitively showed that posits are much more accurate than floats on a range of applications, so much so that a lower (power-of-two) precision can be used. Like 32-bit posits instead of 64-bit floats for shock hydrodynamics, and 16-bit posits instead of 32-bit floats for climate and weather prediction. For signal processing, 16-bit posits are about 10 dB more accurate (less noise) than 16-bit floats, which means they can perform lossless Fast Fourier Transforms (FFTs) on data from 12-bit A-to-D convertors.

For the same precision, posit hardware add/subtract units appear slightly more expensive than float add/subtract, and multiplier units are slightly cheaper for posits than for floats. This echos what was found comparing the speed of the Berkeley SoftFloat emulator with that of Cerlane Leong's SoftPosit emulator. Naive studies say posits are more expensive because they first decode the posit into float subfields, apply time-honored float algorithms, then re-encode the subfields into posit format. This does not exploit the perfect mapping of posits to 2's complement integers.

Float comparison hardware is quite complicated and expensive because there are redundant representation like –0 and +0 that have to test as equal, and redundant NaN exceptions that have to test as not equal even when their bit patterns are identical. Posit comparison hardware is unnecessary because they test exactly the same way as 2's complement integers. NaR is the 2's complement integer that has no absolute value and cannot be negated, 1000...000 in binary. It is equal to itself and less than any real-valued posit.

The name is NaR because IEEE 754 incorrectly states that imaginary numbers are not numbers, and sqrt(–1) returns NaN. The Posit Standard is more careful to say that it is not a _real_.

The Posit Standard is up to Version 4.13 and close to full approval by its Working Group. Don't use any Version 3 or earlier. The one on posithub.org may be out of date. In Version 4, the number of eS bits was fixed at 2, greatly simplifying conversions between different precisions. Unlike floats, posit precision can be changed simply by appending bits or rounding them off, without any need to decode the fraction and the scaling. It's like changing a 16-bit integer to a 32-bit integer; it costs next to nothing, which really helps people right-size the precision they're using.