In AI, that doesn't sound too surprising to me right now.
I just experiment with some local LLMs, but the differences are pretty huge:
Llama 3 8B, Raspberry Pi 5: 2-3 Tokens/second (but it works!)
Llama 3 8B, RTX 4080: ~60 Tokens/second
Llama 3 8B, groq.com LPU, ~1300 Tokens/second
Llama 3 70B, AMD 7800X3D: 1-2 Tokens/second
Llama 3 70B, groq.com LPU, ~330 Tokens/second
There seem to be huge gaps between CPU, GPU and specialized inference ASICs. I'm guessing that right now there aren't many genius-level architecture breakthroughs, and that it's more about how much memory and silicon real estate you're willing to dedicate to AI inference.
> I think groq doesn't use quantization, so the gap between your hardware and groq would be even further apart.
To my knowledge this isn't (absolutely) publicly known but users on /r/LocalLLaMA and elsewhere have provided some pretty clear examples that Groq is almost certainly quantized. Which makes sense considering their memory situation...
An entire GroqRack (42U cabinet) has 14GB of RAM which means it likely can't even reasonably run llama3 8b in BF16/FP16. Let alone 70b, Mixtral, etc.
The amount of hardware required to run their public-facing hosted product likely takes up an obscene amount of floor space, even in int4. Their docs for GrowFlow describe int8 quantization but their toolkit is heavily dependent on ONNX, which has had recent tremendous work in terms of different post training quantization strategies and precisions.
However, the power efficiency vs performance is very good, potentially to the point of being able to use very cheap datacenter/co-location space that isn't capable of meeting the power and (air) cooling densities of datacenter AMD and Nvidia GPU products.
Interestingly I have access to a GroqRack system that I'm hoping to be able to spend some time on this week.
I have 64 GB, but it really depends on the quantization. Looking at LM Studio I see versions ranging from 15 GB to 49 GB, and that's roughly how much RAM they will require.
LM Studio will also let you do partial GPU offloads, but I've only started experimenting with that. The 1-2 Tokens/second value is what I got using GPT4All.
Or they're trying to distract attention from the fact that they've already sold out 100% of the fab capacity available to produce these chips for the next two years.
So really, they lose nothing. They've already booked sales of everything there is to sell. So might as well now turn attention to those who might be customers two years from now, and make them feel like the wait will be worth it.
You were probably downvoted because you were shilling your company, and you misunderstood the comment.
"The Osborne effect is a social phenomenon of customers canceling or deferring orders for the current, soon-to-be-obsolete product as an unexpected drawback of a company's announcing a future product prematurely. It is an example of cannibalization."
I think there is a close limit considering most of these gains are coming from the reduced memory bandwidth consumption that comes with the smaller data types. This would line up with Nvidia’s crazy graph from yesterday where data types were specified.
How much lower can these go though? 2bit? 1.58bit? 1bit? It seems that these massive gains have a very hard stop to gains that AMD and Nvidia will use to raise their stock price before it all comes to a sudden end.
Such a weird & cruel modernity, where these releases are purely in the abstract. No, you still won't be able to buy a MI300X in Q4 2024. The enhanced edition will absolutely not be available.
(I miss the old PC era where the world at large was benefiting in tandem from new things happening (or falling behind from not adapting)).
I think that's where short-sighted financial gain leads AMD to. Where's the money? -- datacenter. So let's focus the good stuff on datacenter exclusivelly. What about "the rest" (gamers, hobbist, students)? There's no money there, let's give theme crap RDNA that we make sure can't be used for any real work; just pretent we're catering for their needs.
I think their "consumer GPU" did so bad recently that AMD could just as well, you know, simply liquidate the "consumer GPU" division and stop pretending.
I'm in the "consumer GPU" market myself; what AMD GPU do I buy today? -- Radeon Pro VII, launched in 2020 and the best AMD consumer GPU I can find today.
It's such a divide. I could optimize my software for such powerful GPUs as the Mi300 line.. but why do that, given that probably I won't even see one such GPU in my lifetime.
The RX 7900s are pretty good. You get 24GB of RAM in a consumer GPU. If you're interested in GenAI that's a good offering for your "gamers, students, hobbyists" category.
They are not sold out. It is just a lot more work to support retail on a novel new product, so they are focused on hyperscalers and CSP's. Don't forget that high end GPUs are US export controlled as well. They are considered weapons by the government. [See 88 Fed. Reg. 73458 (Oct. 25, 2023) and the Export administration Regulations (EAR)].