Hacker News new | ask | show | jobs
by ryao 390 days ago
> At over 2,500 t/s, Cerebras has set a world record for LLM inference speed on the 400B parameter Llama 4 Maverick model, the largest and most powerful in the Llama 4 family.

This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

As for the speed record, it seems important to keep it in context. That comparison is only for performance on 1 query, but it is well known that people run potentially hundreds of queries in parallel to get their money out of the hardware. If you aggregate the tokens per second across all simultaneous queries to get the total throughput for comparison, I wonder if it will still look so competitive in absolute performance.

Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things:

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model. Each one costs ~$2 million, so that is $40 million. The DGX B200 that they used for their comparison costs ~$500,000:

https://wccftech.com/nvidia-blackwell-dgx-b200-price-half-a-...

You only need 1 DGX B200 to run Llama 4 Maverick. You could buy ~80 of them for the price it costs to buy enough Cerebras hardware to run Llama 4 Maverick.

Their latencies are impressive, but beyond a certain point, throughput is what counts and they don’t really talk about their throughput numbers. I suspect the cost to performance ratio is terrible for throughput numbers. It certainly is terrible for latency numbers. That is what they are not telling people.

Finally, I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips, during fabrication at TSMC, or designing round chips, they have a dead end product since it relies on using an entire wafer to be able to throw SRAM at problems. Nvidia, using DRAM, is far less reliant on SRAM and can use more silicon for compute, which is still shrinking.

8 comments

>Each one costs ~$2 million, so that is $40 million.

Pricing for exotic hardware that is not manufactured at scale is quite meaningless. They are selling tokens over an API. The token pricing is competitive with other token APIs.

Last year, I took the time to read through public documents and estimated that their annual production was limited to ~300 wafers per year from TSMC. That is not Nvidia level scale, but it is scale.

There are many companies that sell tokens from an API and many more that need hardware to compute tokens. Cerebras posted a comparison of hardware options for these companies, so evaluating it as such is meaningful. It is perhaps less meaningful to the average person who cannot afford the barrier to entry to afford this hardware, but there are plenty of people curious what the options are for the companies that sell tokens through APIs, as those impact available capacity.

> There are many companies that sell tokens from an API

I was just at Dell Tech World and they proudly displayed a slide during the CTO keynote that said:

"Cost per token decreased 4 orders of magnitude"

Personally speaking, not a business I'd want to get into.

Some context is needed for this. The only way to get a 4 orders of magnitude difference would be to compare incomparable things, like OpenAI’s most expensive model versus llama 3.1 8B.
I agree on the first. On the second: I would bet a lot of money that they aren't actually breaking even on their API (or even close to). They don't have a "pay as you go" per-token tier, it's all geared up to demonstrate use of their API as a novelty. They're probably burning cash on every single token. But their valuation and hype has surely gone way up since they got onto LLMs.
They seem to have dev tier pricing (https://inference-docs.cerebras.ai/support/pricing). It's likely that they don't make much money on this and only make money on large enterprise contracts.
> This is incorrect. The unreleased Llama 4 Behemoth is the largest and most powerful in the Llama 4 family.

Emphasis mine.

Behemoth may become the largest and most powerful llama model, but right now it's nothing but vaporware. Maverick is currently the largest and more powerful llama model today (and if I had to bet, my money would be on Meta discarding Llama4 Behemoth entirely it eventually without having released it, and moving on to the next version number).

> Also, Cerebras is the company that not only was saying that their hardware was not useful for inference until some time last year, but even partnered with Qualcomm with the claim that Qualcomm’s accelerators had a 10x price performance improvement over their things

Mistral says they run Le Chat on Cerebras

How is that related to the claim that Cerebras themselves made about their hardware’s price performance ratio?

https://www.cerebras.ai/press-release/cerebras-qualcomm-anno...

Also perplexity
> SRAM scaling is dead

I'm /way/ outside my expertise here, so possibly-silly question. My understanding (any of which can be wrong, please correct me!) is that (a) the memory used for LLMs is dominantly parameters, which are read-only during inference; (b) SRAM scaling may be dead, but NVM scaling doesn't seem to be; (c) NVM read bandwidth scales well locally, within an order of magnitude or two of SRAM bandwidth, for wide reads; (d) although NVM isn't currently on leading-edge processes, market forces are generally pushing NVM to smaller and smaller processes for the usual cost/density/performance reasons.

Assuming that cluster of assumptions is true, does that suggest that there's a time down the road where something like a chip-scale-integrated inference chip using NVM for parameter storage solves?

The processes used for logic chips, and the processes used for NVM are typically different. The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM, but the quantities of FeRAM are incredibly small there and the process technology is ancient. It seems unlikely to me that the rest of the industry will combine the processes such that you can have both on a single wafer, but you would have better luck asking a chip designer.

That said, NVM often has a wear-out problem. This is a major disincentive for using it in place of SRAM, which is frequently written. Different types of NVM have different endurance limits, but if they did build such a chip, it is only a matter of time before it stops working.

> The only case I know of the industry combining them onto a single chip would be Texas Instruments’ MSP430 microcontrollers with FeRAM

Every microcontroller with on-chip NVM would count. Down to 45 nm, this is mostly Flash, with the exception of the MSP430's FeRAM. Below that... we have TI pushing Flash, ST pushing PCM, NXP pushing MRAM, and Infineon pushing (TSMC's) RRAM. All on processes in the 22 nm (planar) range, either today or in the near future.

> This is a major disincentive for using it in place of SRAM, which is frequently written.

But isn't parameter memory written once per model update, for silicon used for inferencing on a specific model? Even with daily writes the typical 10k - 1M allowable writes for most of the technologies above would last decades.

I had been unaware of the others. Anyway, you need writes to the KV cache for every token generated. You are going to hit that fast.
> I have trouble getting excited about Cerebras. SRAM scaling is dead, so short of figuring out how to 3D stack their wafer scale chips

AMD and TSMC are stacking SRAM on the chip scale. I imagine they could accomplish it at the wafer scale. It'll be neat if we can get hundreds of layers in time, like flash.

Your analysis seems spot on to me.

More on the CPU side than the GPU side. GPU is still dominated by HBM.
Assume you meant Intel, rather than AMD?
Yes, and it's TSMC enabling this. Lots of TSMC's customers going this route, not just AMD. Seemed odd to call out AMD as if they've got any special sauce here.
My choices can seem odd to you, that's fine. Have a nice day!
Performance per watt is better than h100 and b200, performance per watt per $ is worse than B200, and it does fp8 just fine

https://arxiv.org/pdf/2503.11698

One caveat is that this paper only covers training, which can be done on a single CS-3 using external memory (swapping weights in and out of SRAM). There is no way that a single CS-3 will hit this record inference performance with external memory so this was likely done with 10-20 CS-3 chips and the full model in SRAM. Definitely can’t compare token/$ with that kind of setup vs a DGX.
Thanks for the correction. They are currently using FP16 for inference according to OpenRouter. I had thought that implied that they could not use FP8 given the pressure that they have to use as little memory as possible from being solely reliant on SRAM. I wonder why they opted to use FP16 instead of FP8.
Performance per watt per dollar is a useless metric as calculated. You can't spend more money on B200s to get more performance per watt.
Pretty much no disagreements IMO.

By the time the CSE-5 is rolled out, it *needs* at least 500GB of SRAM to make it worthwhile. Multi-layer wafer stacking's the only path to advance this chip.

>Their hardware does inference with FP16, so they need ~20 of their CSE-3 chips to run this model.

Care to explain? I don't see it.

CSE-3 chip has 44GB, which can hold 22B parameters in FP16.

400B parameters would need 18 chips. Then you need a bit more ram for other stuff

That's on-chip SRAM, comparable to a GPU's L1 cache, of which it typically has megabytes.

CSE systems also come with off-chip memory, comparable to a GPU's memory, but usually in the TB range.

The memory bandwidth for that is 150GB/sec. Inference speed is memory bandwidth bound, so that memory is useless for inference. Discrete GPUs will run circles around the CSE-3 at inference if they tried using the external DRAM.
Where do you get those 150GB/sec from?

Here [1] they imply they can reach 1.2Tbps (allegedly, I know), and that's the previous generation ...

1: https://f.hubspotusercontent30.net/hubfs/8968533/Virtual%20B...

The other comment already clarified that 150GB/sec = 1.2Tbps. That said, the CSE-3 did not change this figure. It is buried in their specification sheets somewhere if you care to search for it. I did last year, which is how I know.
Doesn't 1.2Tbps / 8 = 150 GBps because 8b = 1B ?
If you want the titled 2500 tokens/second, you need to use the on-chip SRAM
What?

Of course they're using the on-chip SRAM, why wouldn't they?

This is a press release from Cerebras about a Cerebras chip, ... of course they are using a Cerebras chip!

Is that not obvious?

They also support external DRAM over their 150GB/sec system IO link. They call it MemoryX and talk about it on these blog posts:

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...

https://www.cerebras.ai/blog/announcing-the-cerebras-archite...

It is useless for inference, but it is great for training. It used to be more prominent on their website, but it is harder to find references to it now that they are mimicking Groq’s business model.