Only a fraction of the total power goes the actual ALU's, even on an ML chip, so the actual top line impact is probably small. Not that it's bad, just that this is a fairly complex change for the amount of power saved. Plus this requires (unproven) changes on the modeling side which isn't desirable.
Not sure I understand the "Plus [...]" part: this is new research, so obviously no one is going to implement this at scale until there's been some time for people to go over the approach and either confirm the maths is solid, or find problems with it. But that is universally true for any new low level design, I assume we all understand that "it's still in peer review" implies "so now it needs to be put to the test", not "and now we all use it without question" =)
Suppose you have three proposed optimizations, all of which save the same amount of power. Change A requires no modeling changes, change B requires modeling changes with well understood impacts, and change C requires modeling changes where the impact is unknown. If you could only implement one, your priority would probably be A > B > C.
Since it's not clear to me what the consequences are on the modeling here, I'd put this in category C. If lots of people start using it, it could move to B. The ideal would still be A though.
False analogy: this is not in the set {A,B,C} yet. Instead, suppose we have one established way A of doing things, and two research papers B and C with alternatives that come with benefits but may require changing either software or hardware standards. You stick with A until B, C, or both have been proven to hold up, and someone has done the actual migration cost/benefit analysis before you even include them in any thoughts or opinions regarding optimization cycles, because until that work has been done by someone else they are not part of any real world solution.
The Kulisch accumulator and entropy coding of the floating point words (tapering) address this particular issue.
They allow you to get away with much smaller word sizes while preserving dynamic range (and precision!) than would otherwise be the case. This is what the "word size"/tapering discussion in the blog. This is the thing that makes 8 bit floating point work in this case with just a drop in replacement via round to nearest even. You have to change significantly more to get 8 bit FP to work without either the Kulisch accumulator or entropy coding, as you have to make much different tradeoffs between precision and dynamic range.
"Users of floating point are seldom concerned simultaneously with with loss of accuracy and with overflow" (or underflow for that matter) [1]
The paper and blog post consider 4-5 different things/techniques, not all of which need be combined and some of which can be considered completely independently. The paper is a little bit gimmicky in that I combine all of them together, but that need not be the case.
(log significand fraction map (LNS), posit/Huffman/other entropy encoding, Kulisch accumulation, ELMA hybrid log/linear multiply-add as a replacement for pure log domain)