Hacker News new | ask | show | jobs
by djoldman 614 days ago
https://arxiv.org/abs/2410.00907

ABSTRACT

Large neural networks spend most computation on floating point tensor multiplications. In this work, we find that a floating point multiplier can be approximated by one integer adder with high precision. We propose the linear-complexity multiplication (L-Mul) algorithm that approximates floating point number multiplication with integer addition operations. The new algorithm costs significantly less computation resource than 8-bit floating point multiplication but achieves higher precision. Compared to 8-bit floating point multiplications, the proposed method achieves higher precision but consumes significantly less bit-level computation. Since multiplying floating point numbers requires substantially higher energy compared to integer addition operations, applying the L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. We calculated the theoretical error expectation of L-Mul, and evaluated the algorithm on a wide range of textual, visual, and symbolic tasks, including natural language understanding, structural reasoning, mathematics, and commonsense question answering. Our numerical analysis experiments agree with the theoretical error estimation, which indicates that L-Mul with 4-bit mantissa achieves comparable precision as float8 e4m3 multiplications, and L-Mul with 3-bit mantissa outperforms float8 e5m2. Evaluation results on popular benchmarks show that directly applying L-Mul to the attention mechanism is almost lossless. We further show that replacing all floating point multiplications with 3-bit mantissa L-Mul in a transformer model achieves equivalent precision as using float8 e4m3 as accumulation precision in both fine-tuning and inference.

4 comments

Does this mean you can train efficiently without GPUs?

Presumably there will be a lot of interest.

No. But it does potentially mean that either current or future-tweaked GPUs could run a lot more efficiently -- meaning much faster or with much less energy consumption.

You still need the GPU parallelism though.

This is still amazing work, imagine running chungus models on a single 3090.
The bottleneck on a consumer-grade GPU like a 3090 isn't the processing power, it's the lack of RAM. The PCI-Express bus ends up being your bottleneck from having to swap in parts of the model.

Even with PCIe 5.0 and 16 lanes, you only get 64 GB/s of bandwidth. If you're trying to run a model too big for your GPU, then for every token, it has to reload the entire model. With a 70B parameter model, 8 bit quantization, you're looking at just under 1 token/sec just from having to transfer parts of the model in constantly. Making the actual computation faster won't make it any faster.

OTOH, doesn't it also mean that (given appropriate software framework support) iGPUs with less processing capacity and slower-but-more RAM available (because system RAM is comparatively cheap and plentiful compared to VRAM) without swapping anything are more competitive against consumer dGPUs with fast-but-small RAM for both inference and training with larger models?
System memory isn't that fast, either. Even with DDR5-8400, the fastest memory you can get right now, you're only looking at a memory transfer speed of 67.2 GB/s, barely faster than the PCI-E bus. So even if you could store that entire 70B model in RAM, you're still getting just under 1 token/sec, and that's assuming your CPU doesn't become a bottleneck.

Your best bet would likely be a laptop that has integrated system RAM with VRAM, but I don't think any of those offer enough RAM to store an entire 70B model. A 7B parameter model would work fine, but you could do those on a consumer-grade GPU anyways.

I had a feeling it had to be something like massive waste due to a misguided feature of the algorithms that shouldn't have been there in the first place.

Once the "math is done" quite likely it would have paid off better than most investments for the top people to have spent a few short years working with grossly underpowered hardware until they could come up with amazing results there before scaling up. Rather than grossly overpowered hardware before there was even deep understanding of the underlying processes.

When you think about it, what we have seen from the latest ultra-high-powered "thinking" machines is truly so impressive. But if you are trying to fool somebody into believing that it's a real person it's still not "quite" there.

Maybe a good benchmark would be to take a regular PC, and without reliance on AI just pull out all the stops and put all the effort into fakery itself. No holds barred, any trick you can think of. See what the electronics is capable of this way. There are some smart engineers, this would only take a few years but looks like it would have been a lot more affordable.

Then with the same hardware if an AI alternative is not as convincing, something has got to be wrong.

It's good to find out this type of thing before you go overboard.

Regardless of speed or power, I never could have gotten an 8-bit computer to match the output of a 32-bit floating-point algorithm by using floating-point myself. Integers all the way and place the decimal where it's supposed to be when you're done.

Once it's really figured out, how do you think it would feel being the one paying the electric bills up until now?

Faster progress was absolutely worth it. Spending years agonizing over theory to save a bit of electric would have been a massive disservice to the world.
You're sort of presuming that LLMs are going to be a massive service to the world there, aren't you? I think the jury is still out on that one.
They already have been. Even just in programming, even just Copilot has been a life changing productivity booster.
“A bit”?
Yes, a large amount for - in the grand scheme of things - a short period of time (i.e., a quantity of energy usage in an intense spike that will be dwarfed by energy usage over time) can accurately be described as “a bit”.

Of course, the impact is that AI will continue to become cheaper to use, and induced demand will continue the feedback loop driving the market as a result.

This comment lives in a fictional world where there is a singular group that could have collectively acted counterfactually. In the real world any actor that individually went this route would have gone bankrupt while the others collected money by showing actual results even if ineffeciently earned.
Also it is likely that the rise of LLMs gave many researchers in allied fields the impetus to tackle with the problems that are relevant to making it more efficient and people stumbled upon a solution hiding there.

The momentum with LLMs and allied technology may last till it keeps on improving even by a few percentage points and keeps shattering human created new benchmarks every few months

This is a bit like recommending to skip vacuum tubes, think hard and invent transistors.
This is kind of thought-provoking.

That is a good correlation when you think about how much more energy-efficient transistors are than vacuum tubes.

Vacuum tube computers were a thing for a while, but it was more out of desperation than systematic intellectual progress.

OTOH you could look at the present accomplishments like it was throwing more vacuum tubes at a problem that can not be adequately addressed that way.

What turned out to be a solid-state solution was a completely different approach from the ground up.

To the extent a more power-saving technique using the same hardware is only a matter of different software approaches, that would be something that realistically could have been accomplished before so much energy was expended.

Even though I've always thought application-specific circuits would be what really helps ML and AI a lot, and that would end up not being the exact same hardware at all.

If power is truly being wasted enough to start rearing its ugly head, somebody should be able to figure out how to fix it before it gets out-of-hand.

Ironically enough with my experience using vacuum tubes, I've felt that there were some serious losses in technology when the research momentum involved was so rapidly abandoned in favor of "solid-state everything" at any cost.

Maybe it is a good idea to abandon the energy-intensive approaches, as soon as anything completely different that's the least bit promising can barely be seen by a gifted visionary to have a glimmer of potential.

That's just not how progress works.

Its iteritive, there are plenty of cul-de-sacs and failures. You can't really optimise until you have something that works and its a messy process that is inefficient.

You're looking at this with hindsight.

Isn’t this paper pretty much about spending a few short years to improve the performance? Or are you arguing that the same people who made breakthroughs over the last few years should have also done the optimization work?
>the same people who made breakthroughs over the last few years should have also done the optimization work

I never thought it would be ideal if it was otherwise, so I guess so.

When I first considered neural nets from state-of-the art vendors to assist with some non-linguistic situations over 30 years ago, it wasn't quite ready for prime time and I could accept that.

I just don't have generic situations all the time which would benefit me, so it's clearly my problems that have the deficiencies ;\

What's being done now with all the resources being thrown at it is highly impressive, and gaining all the time, no doubt about it. It's nice to know there are people that can afford it.

I truly look forward to more progress, and this may be the previously unreached milestone I have been detecting that might be a big one.

Still not good enough for what I need yet so far though. And I can accept that as easily as ever.

That's why I put up my estimation that not all of those 30+ years has been spent without agonizing over something ;)

The GPU main advantage is its parallelism - thousands of cores compared handful of cores in CPUs.

If you're training models with billions of parameters, you're still gonna need that.

I feel like I have seen this idea a few times but don't recall where but stuff posted via HN.

Here https://news.ycombinator.com/item?id=41784591 but even before that. It is possibly one of those obvious ideas to people steeped in this.

To me intuitively using floats to make ultimatelty boolean like decisions seems wasteful but that seemed like the way it had to be to have diffetentiable algorithms.

we used to use Fixed point multiplications (Q Format) in DSP algorithms on different DSP architectures. https://en.wikipedia.org/wiki/Q_(number_format). They used to be so fast and near accurate to floating point multiplications. Probably we need to use those DSPs blocks as part of Tensors/GPUs to realise both fast multiplications & parallelisms.
Is this effectively quantizing without actually quantizing?