Hacker News new | ask | show | jobs
by remexre 605 days ago
Isn't this just taking advantage of "log(x) + log(y) = log(xy)"? The IEEE754 floating-point representation stores floats as sign, mantissa, and exponent -- ignore the first two (you quantitized anyway, right?), and the exponent is just an integer storing log() of the float.
2 comments

Not quite: It's taking advantage of (1+a)(1+b) = 1 + a + b + ab. And where a and b are both small-ish, ab is really small and can just be ignored.

So it turns the (1+a)(1+b) into 1+a+b. Which is definitely not the same! But it turns out, machine guessing apparently doesn't care much about the difference.

You might then as well replace the multiplication by the addition in the original network. In that case you're not even approximating anything.

Am I missing something?

They're applying that simplification to the exponent bits of an 8 bit float. The range is so small that the approximation to multiplication is going to be pretty close.
Plus the 2^-l(m) correction term.

Feels like multiplication shouldn't be needed for convergence, just monotonicity? I wonder how well it would perform if the model was actually trained the same way.

This trick is used a ton when doing hand calculation in engineering as well. It can save a lot of work.

You're going to have tolerance on the result anyway, so what's a little more error. :)

yes. and the next question is 'ok, how do we add'
Yes. I haven't yet read this paper to see what exactly it says is new, but I've definitely seen log-based representations under development before now. (More log-based than the regular floating-point exponent, that is. I don't actually know the argument behind the exponent-and-mantissa form that's been pretty much universal even before IEEE754, other than that it mimics decimal scientific notation.)
I guess that if the bulk of the computation goes into the multiplications, you can work in the log-space and simply sum, and when the time comes to actually do a sum on the original space you can go back and sum.
Not sure how well that would work if you're often adding bias after every layer