| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by grandmczeb 2782 days ago

Here's the bottom line for anyone who doesn't want to read the whole article.

> Using a commercially available 28-nanometer ASIC process technology, we have profiled (8, 1, 5, 5, 7) log ELMA as 0.96x the power of int8/32 multiply-add for a standalone processing element (PE).

> Extended to 16 bits this method uses 0.59x the power and 0.68x the area of IEEE 754 half-precision FMA

In other words, interesting but not earth shattering. Great to see people working in this area though!

5 comments

jhj 2782 days ago

At least 69% more multiply-add flops at the same power iso-process is nothing to sneeze at (we're largely power/heat bound at this point), and unlike normal floating point (IEEE or posit or whatever), multiplication, division/inverse and square root are more or less free power, area and latency-wise. This is not a pure LNS or pure floating point because it is a hybrid of "linear" floating point (FP being itself hybrid log/linear, but the significand is linear) and LNS log representations for the summation.

Latency is also a lot less than IEEE or posit floating point FMA (not in the paper, but the results were only at 500 MHz because the float FMA couldn't meet timing closure at 750 MHz or higher in a single cycle, and the paper had to be pretty short with a deadline, so couldn't explore the whole frontier and show 1 cycle vs 2 cycle vs N cycle pipelined implementations).

The floating point tapering trick applied on top of this can help with the primary chip power problem, which is moving bits around, so you can solve more problems with a smaller word size because your encoding matches your data distribution better. Posits are a partial but not complete answer to this problem if you are willing to spend more area/energy on the encoding/decoding (I have a short mention about a learned encoding on this matter).

A floating point implementation that is more efficient than typical integer math but in which one can still do lots of interesting work is very useful too (providing an alternative for cases where you are tempted to use a wider bit width fixed point representation for dynamic range, or a 16+ bit floating point format).

link

grandmczeb 2782 days ago

The work is definitely great and I have no doubt we'll see new representations used in the future. But at least on the chip I work on, this would be a <5% power improvement in the very best case. For the risk/complexity involved, I would hope for a lot more.

link

TheRealPomax 2782 days ago

Wait, 0.59x isn't Earth shattering? That's almost half the power, and at only 2/3 the area. Those are _huge_ differences at data center scale!

link

grandmczeb 2782 days ago

Only a fraction of the total power goes the actual ALU's, even on an ML chip, so the actual top line impact is probably small. Not that it's bad, just that this is a fairly complex change for the amount of power saved. Plus this requires (unproven) changes on the modeling side which isn't desirable.

link

TheRealPomax 2781 days ago

Not sure I understand the "Plus [...]" part: this is new research, so obviously no one is going to implement this at scale until there's been some time for people to go over the approach and either confirm the maths is solid, or find problems with it. But that is universally true for any new low level design, I assume we all understand that "it's still in peer review" implies "so now it needs to be put to the test", not "and now we all use it without question" =)

link

grandmczeb 2781 days ago

Suppose you have three proposed optimizations, all of which save the same amount of power. Change A requires no modeling changes, change B requires modeling changes with well understood impacts, and change C requires modeling changes where the impact is unknown. If you could only implement one, your priority would probably be A > B > C.

Since it's not clear to me what the consequences are on the modeling here, I'd put this in category C. If lots of people start using it, it could move to B. The ideal would still be A though.

link

TheRealPomax 2780 days ago

False analogy: this is not in the set {A,B,C} yet. Instead, suppose we have one established way A of doing things, and two research papers B and C with alternatives that come with benefits but may require changing either software or hardware standards. You stick with A until B, C, or both have been proven to hold up, and someone has done the actual migration cost/benefit analysis before you even include them in any thoughts or opinions regarding optimization cycles, because until that work has been done by someone else they are not part of any real world solution.

link

jhj 2781 days ago

The Kulisch accumulator and entropy coding of the floating point words (tapering) address this particular issue.

They allow you to get away with much smaller word sizes while preserving dynamic range (and precision!) than would otherwise be the case. This is what the "word size"/tapering discussion in the blog. This is the thing that makes 8 bit floating point work in this case with just a drop in replacement via round to nearest even. You have to change significantly more to get 8 bit FP to work without either the Kulisch accumulator or entropy coding, as you have to make much different tradeoffs between precision and dynamic range.

"Users of floating point are seldom concerned simultaneously with with loss of accuracy and with overflow" (or underflow for that matter) [1]

The paper and blog post consider 4-5 different things/techniques, not all of which need be combined and some of which can be considered completely independently. The paper is a little bit gimmicky in that I combine all of them together, but that need not be the case.

(log significand fraction map (LNS), posit/Huffman/other entropy encoding, Kulisch accumulation, ELMA hybrid log/linear multiply-add as a replacement for pure log domain)

[1] Morris, Tapered floating point: a new floating point representation (1971) https://ieeexplore.ieee.org/abstract/document/1671767

link

jacquesm 2782 days ago

You are wrong that this is not 'earth shattering', 40% efficiency increases are roughly what you'd get from a process node step, given that those are rather few and far between now this is the equivalent of extending Moore's law by another 5 to 10 years.

link

dnautics 2782 days ago

That's for the actual number crunching but the real power cost is often in bandwidth (as discussed earlier in the op). If you can reliably use lower precision stuff for training, you get 4x the flops for a halving of the bandwidth costs due to matrix mult being O(n^2)

link

_yosefk 2781 days ago

They don't show a comparison to bfloat16 PEs/FMA. IEEE half precision uses a larger mantissa than bfloat16, and the cost of multiplication is proportionate to the square of the mantissa size. I'd expect much lower gains relatively to bfloat16

link