Hacker News new | ask | show | jobs
by shrubble 748 days ago
The claim that the next generation would be 35x faster, felt like an "Osborne moment" to me, but if demand is robust enough...
3 comments

In AI, that doesn't sound too surprising to me right now.

I just experiment with some local LLMs, but the differences are pretty huge:

Llama 3 8B, Raspberry Pi 5: 2-3 Tokens/second (but it works!)

Llama 3 8B, RTX 4080: ~60 Tokens/second

Llama 3 8B, groq.com LPU, ~1300 Tokens/second

Llama 3 70B, AMD 7800X3D: 1-2 Tokens/second

Llama 3 70B, groq.com LPU, ~330 Tokens/second

There seem to be huge gaps between CPU, GPU and specialized inference ASICs. I'm guessing that right now there aren't many genius-level architecture breakthroughs, and that it's more about how much memory and silicon real estate you're willing to dedicate to AI inference.

What quantization levels did you use?

I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.

> I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.

To my knowledge this isn't (absolutely) publicly known but users on /r/LocalLLaMA and elsewhere have provided some pretty clear examples that Groq is almost certainly quantized. Which makes sense considering their memory situation...

An entire GroqRack (42U cabinet) has 14GB of RAM which means it likely can't even reasonably run llama3 8b in BF16/FP16. Let alone 70b, Mixtral, etc.

The amount of hardware required to run their public-facing hosted product likely takes up an obscene amount of floor space, even in int4. Their docs for GrowFlow describe int8 quantization but their toolkit is heavily dependent on ONNX, which has had recent tremendous work in terms of different post training quantization strategies and precisions.

However, the power efficiency vs performance is very good, potentially to the point of being able to use very cheap datacenter/co-location space that isn't capable of meeting the power and (air) cooling densities of datacenter AMD and Nvidia GPU products.

Interestingly I have access to a GroqRack system that I'm hoping to be able to spend some time on this week.

Ah TIL, thanks for the insights!
I don't remember exactly, whatever came out first on Huggingface I guess. Some Q4 variant probably.
> Llama 3 70B, AMD 7800X3D: 1-2 Tokens/second

How much RAM is required for this result? It's quite impressive that it even works as well as it does.

I have 64 GB, but it really depends on the quantization. Looking at LM Studio I see versions ranging from 15 GB to 49 GB, and that's roughly how much RAM they will require.

LM Studio will also let you do partial GPU offloads, but I've only started experimenting with that. The 1-2 Tokens/second value is what I got using GPT4All.

Nvidia is doing the same thing. They announced B100 before H200 shipped and a few hours ago they started talking about R100 before B100 shipped.
(Re: Osborne effect) It's going to be released in 2 years. Rarely can businesses wait that long, they're going to be ordering the MI300 now.
Or they're trying to distract attention from the fact that they've already sold out 100% of the fab capacity available to produce these chips for the next two years.

So really, they lose nothing. They've already booked sales of everything there is to sell. So might as well now turn attention to those who might be customers two years from now, and make them feel like the wait will be worth it.

[deleted] See below, I did not understand the Osborne effect comment.
You're going to wait for the MI350 and not order any more MI300s?
Weird that I got downvoted on the above. I'm buying and deploying MI300x's today and will buy whatever AMD comes out with next.
You were probably downvoted because you were shilling your company, and you misunderstood the comment.

"The Osborne effect is a social phenomenon of customers canceling or deferring orders for the current, soon-to-be-obsolete product as an unexpected drawback of a company's announcing a future product prematurely. It is an example of cannibalization."

Shilling is ok on a topic directly related to my business.

You're right on the Osborne effect though! Thanks for that. We are definitely not doing that.

To clarify: When we started, MI300x was not officially announced yet, so we were planning on buying MI250's. Due to everything taking longer than expected around starting the business and receiving funding, by the time we had money in the bank, it was time to buy MI300x. Going forward, we are buying MI300x today and will continue to buy AMD MI series as they are released in the future.