Hacker News new | ask | show | jobs
by adrian_b 1340 days ago
For general computational applications (i.e. not for special graphics cases), implementing double-precision operations using single-precision operations is considerably more complicated than implementing quadruple-precision operations using double-precision operations.

The reason is that it is not enough to extend the precision of the 32-bit FP numbers. The exponent range must also be extended. The standard double-precision numbers have an exponent range that is large enough to make underflow and overflow very unlikely in most algorithms. With the very small exponent range of FP32 numbers, underflow and overflow is very likely and this must be corrected in any double precision implementation.

So it is not enough to use two FP32 numbers to represent one FP64 number. One must use either a third number for the exponent, or at least one of the two 32-bit numbers must be integer and partitioned into exponent and significand parts.

Both approaches will lead to much more complex algorithms and a much worse speed ratio for FP64 implemented with FP32 vs. FP128 implemented with FP64.

1 comments

It's interesting that you find the idea of "only" being able to represent numbers as small as 10^-38 and as large as 10^38 as having "very small exponent."

In deep learning, this is huge! If you have numbers this big, then something is definitely already wrong. If you have numbers that small, then you definitely don't care.

I wonder if deep learning will save us from poorly conditioned linear algebra too.

In physics there are many universal constants or material constants with ranges between 10^10 and 10^40, and their reciprocals are between 10^-10 and 10^-40.

Some of these cannot be represented in single precision, while for the others one or two multiplications or divisions are enough to cause underflows and overflows. Such wide ranges are unavoidable in complex physical simulations, because their origin is in the ratios between quantities at human or astronomic sizes and quantities at atomic or molecular sizes.

Single precision values are perfectly adequate to represent the input data and the final results of any computation, because 24 bit is about the limit for any analog-to-digital or digital-to-analog conversion, and the exponent range is also sufficient for the physical quantities that can be measured directly, but when you simulate any semiconductor device and even when you simulate just an electrical circuit with discrete components, it is very frequent to have intermediate results with values much outside the range that can be represented in single precision, even up to 10^60 or 10^-60. When computing a high-order polynomial in order to approximate the solution of some problem, some intermediate values may be even outside that range.

In theory it is possible to avoid underflows and overflows by introducing a large number of scale factors in the equations, in appropriate places.

However, handling those scale factors in a program is extremely tedious and error prone. The floating point format was invented precisely in order to avoid the need of dealing with scale factors. If someone introduces scale factors in a program, they might as well use fixed-point numbers, because the main advantage of the floating-point numbers is lost.