| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by geuis 836 days ago
	Want to reference Groq.com. They are developing their own inference hardware called an LPU https://wow.groq.com/lpu-inference-engine/ They also released their API a week or 2 ago. Its significantly faster than anything from OpenAI right now. Mixtral 8x7b operates at around 500 tokens per second. https://groq.com/

1 comments

moffkalast 836 days ago

It's not so much an accelerator as it is addressing the main inference bottleneck (i.e. memory latency) with sheer brute force by throwing money at the problem. They've made accelerators out of pure L3 cache with a whopping 230 MB per card. They cited something like 500 cards to load one single Mixtral instance, which probably cost over $10M to build. It's a supercomputer essentially.

jiggawatts 836 days ago

Or to put it another way: they’ve made a compute substrate with the correct ratios of processing power to memory capacity.

NVIDIA GPUs were optimised for different workloads, such as 3D rendering, that have different optimal ratios.

This “supercomputer” isn’t brute force or wasteful because it allows more requests per second. By having each response get processed faster it can pipeline more of them through per unit time and unit silicon area.

cavisne 836 days ago

A recent presentation on the architecture

https://youtu.be/WQDMKTEgQnY?si=W0E9Kq6P280l3Wcl

IMO we still need an MLPerf submission or similar to really understand if this is more efficient or more efficient only if you also want to minimize latency.

Nvidia has pulled enough rabbits out of the hat when it comes to MLPerf I’m still not convinced they can’t work some CUDA magic and undercut them on efficiency.

wmf 836 days ago

The correct ratio for one workload (production inference).

ben-schaaf 836 days ago

> pure L3 cache with a whopping 230 MB per card

Just to put these numbers in perspective a desktop 8 core 7800x3d has 96MB of L3 cache, and the top-end 96-core Epyc 9684X has 1.15GB of L3.

LoganDark 836 days ago

They need 568 LPUs to load both Mixtral 8x7B and LLaMA 70B, because they need both those models available for the demo.

I imagine Mixtral by itself would only take something like 200-300 LPUs

wmf 836 days ago

Only $5M then.

LoganDark 835 days ago

I'm pretty sure $20,000 per LPU isn't actually the cost of these LPUs. I saw someone else on HN asking if $20,000 could get them something and an employee said to reach out. Which makes me think $20,000 is enough to get some sort of model running at least, even if it's not necessarily an LLM.

int_19h 836 days ago

$5M once, upfront. But given the significantly increased throughput, how fast does that pay for itself?

fzzzy 835 days ago

You need computers for all of them and megawatts of power, power supplies, cooling, and power distribution.

int_19h 835 days ago

Naturally, but you need that for GPUs as well, no? What is the actual difference when running, when measured per token generated?

gessha 836 days ago

Depends on power usage. I’m curious how power hungry those are compared to server/workstation cards.

hackerlight 836 days ago

What's the cost per inference relative to H100? Isn't that the number to care about?

hobofan 836 days ago

Based on some rough ballpark conservative estimates (one server with 2 A100 at $50000; 50 tokens/s one one of those servers; so 10 of those servers), upfront cost with consumer hardware seems to be 1/10 to 1/20 of what the Groq hardware costs. I would guess that realistically cloud providers can probably achieve half to 1/3 of that price

So unless you need the fast latency of Groq, consumer hardware seems to be a lot cheaper for the same thoughput.

542458 836 days ago

If you believe the marketing material it’s lower. Their API is the cheapest around, so either it’s true or they’re subsidizing.

hackerlight 836 days ago

Another consideration: Even if it's slightly more expensive, that can be OK if you care about inference speed. I'd pay 50% more for GPT-4 if it could deliver results that quick.

imtringued 836 days ago

Grayskull has 96 MB SRAM and people call it overpriced at $600 to $800. It is far more plausible that their chip costs are somewhere around $500.

moralestapia 836 days ago

Nothing wrong with that, though.