Hacker News new | ask | show | jobs
by dnautics 2782 days ago
Wow! It's kind of a wierd feeling to see some research I worked on get some traction in the real world!! The ELMA lookup problem for 32 bit could be fixed by using the posit standard, which just has "simple" adders for the section past the golomb encoded section, though you may have to worry about spending transistors on the barrel shifter.
1 comments

The ELMA LUT problem is in the log -> linear approximation to perform sums in the linear domain. This avoids the issue that LNS implementations have had in the past, which is in trying to keep the sum in the log domain, requiring an even bigger LUT or piecewise approximation of the sum and difference non-linear functions.

This is independent of any kind of posit or other encoding issue (i.e. it has nothing to do with posits).

(I'm the author)

Thanks for your work!! (And citing us ofc)

Do you think there might be an analytic trick that you could use for higher size ELMA numbers that yields semiaccurate results for machine learning purposes? Although to be honest I still think with a kuslich FMA and an extra operation for fused exponent add (softmax e.g.) you can cover most things you'll need 32 bits for with 8

I've thought of that, but the problem is that it needs to linearly interpolate between the more accurate values, and depending upon how finely grained the linear interpolation is, you would need a pretty big fixed point multiplier to do that interpolation accurately.

If you didn't want to interpolate with an accurate slope, and just use a linear interpolation with a slope of 1 (using the approximations 2^x ~= 1+x and log_2(x+1) ~= x for x \in [0, 1)), then there's the issue that I discuss with the LUTs.

In the paper I mention that you need at least one more bit in the linear domain than the log domain (i.e., the `alpha` parameter in the paper is 1 + log significand fractional precision) for the values to be unique (such that log(linear(log_value)) == log_value) because the slope varies significantly from 1, but if you just took the remainder bits and used that as a linear extension with a slope of 1 (i.e., just paste the remainder bits on the end, and `alpha` == log significand fractional precision), then log(linear(log_value)) != log_value everywhere. Whether or not this is a real problem is debatable though, but probably has some effect on numerical stability if you don't preserve the identity.

Based on my tests I'm skeptical about training in 8 bits for general problems even with the exact linear addition; it doesn't work well. If you know what the behavior of the network should be, then you can tweak things enough to make it work (as people can do today with simulated quantization during training, or with int8 quantization for instance), but generally today when someone tries something new and it doesn't work, they tend to blame their architecture rather than the numerical behavior of IEEE 754 binary32 floating point. There are some things even today in ML (e.g., Poincaré embeddings) that can have issues even at 32 bits (in both dynamic range and precision). It would be a lot harder to know what the problem is in 8 bits when everything is under question if you don't know what the outcome should be.

This math type can and should also be used for many more things than neural network inference or training though.

> It would be a lot harder to know what the problem is in 8 bits when everything is under question if you don't know what the outcome should be.

I might have a solution for that : I work on methods to both quantify the impact of your precision on the result and locate the sections of your code that introduced the significant numerical errors (as long as your numeric representation respects the IEEE standard).

However, my method is designed to test or debug the numerical stability of a code and not be used in production (as it impacts performances).

None of the representations considered in the paper (log or linear posit or log posit) respect the IEEE standard, deliberately so :)
You drop denormals and change the distribution but do you keep the 0,5 ULP (round to nearest) garantee from the IEEE standard ? And are your rounding errors exact numbers in your representation (can you build Error Free Transforms) ?