Hacker News new | ask | show | jobs
by ryao 384 days ago
You replied really quickly when I had thought I could sneak in a revision, which dropped the estimates for production numbers. In any case, the Cerebras CSE-3 is extremely inefficient for what it does. Inference is memory bandwidth bound, such that peak performance for a single query should be close to the memory bandwidth divided by the weights. Despite having. 2600x the memory bandwidth, they can only perform 2.5 times faster. 1000x of their supposed memory bandwidth is wasted. There are extreme inefficiencies in their architecture. Meanwhile, Nvidia is often within >80% of what memory bandwidth divided by weights predict their hardware can do.

Mistral is a small fish in the grander scheme of things. I would assume that using Cerebras is a way to try to differentiate themselves in a market where they are largely ignored, which is the reason Mistral is small enough to be able to have their needs handled by Cerebras. If they grow to OpenAI levels, there is no chance of Cerebras being able to handle the demand for them.

Finally, I had researched this out of curiosity last year. I am posting remarks based on that.

1 comments

Inference is memory bandwidth bound on a GPU, which has very little on-chip memory.

On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.

This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.

It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.

Your “more efficient” remarks are nonsensical to me. Your “loading and unloading weights” remark would be slightly less nonsensical if you called it to Von Neumann bottleneck, but unfortunately for you, their hardware is so bottlenecked internally that they they are getting less than 0.1% of the performance that their supposedly high memory bandwidth can give them. Nvidia on the other hand routinely gets 80% or higher. Calling less than 0.1% of theoretical performance efficient is not only strange, but outright wrong. That said, efficiency usually considers other metrics such as costs, power consumption and throughput.

If you do the math you will find that Cerebras loses in all of them. They need 460 kW from 20x CSE-3 nodes to do inference for Llama 4 Maverick. A single DGX-200 node only needs 14.4kW. If you buy 32 nodes so that power consumption is the same and naively give each a full copy of the model, you would get 32,000 T/sec aggregate from a batch size of 1 while the 20 CSE-3 node cluster only gets 2,500 T/sec aggregate from a batch size of 1. This is having spent only $16 million for the 32 DGX B200 nodes versus the $40 million for the 20 CSE-3 nodes. Each DGX B200 node has 1.4TB of memory, while the CSE-3 cluster has only 880GB of memory. The CSE-3 cluster will run out of memory as you scale the batch size and context length. Now, if you buy another 15 CSE-3 nodes, you could match the memory of a single DGX B200, but then you could just store partial models on each DGX-200 like how Cerebras stored partial models on each CSE-3, and suddenly, you have more memory to scale to higher batch sizes on the Nvidia hardware. At some point, you will likely become compute bound and cannot keep scaling up the batch size, but that is hard to predict without actually testing for it. The prediction for what the CSE-3 could do based on advertised memory bandwidth was off by a factor of >1000 when given real data. It seems reasonable to think that what it can do as far as compute will similarly be limited to well below the theoretical capability.

Note that my numbers for power consumption were from Cerebras:

https://www.cerebras.ai/blog/cerebras-cs-3-vs-nvidia-b200-20...

Interestingly, the peak number for the DGX B200 is based on the power supplies for the DGX B200 and is actually 0.1 kW higher than Nvidia’s specification that puts it at 14.3kW:

https://docs.nvidia.com/dgx/dgxb200-user-guide/introduction-...

PSU peak output is always in excess of the maximum power usage capability of the hardware, but I did not know how Cerebras determined their 23kW figure, so I went with the Cerebras figure for Nvidia, even though I know it is unrealistically high. This likely gave Cerebras the benefit of a handicap on Nvidia’s hardware in the comparison, such that reality is even more in favor of Nvidia.

Calling Cerebras’ hardware the best way of doing inference is ridiculous. We are talking about doing mostly linear algebra. There is no best way of doing it. Pointing at Mistral to say that Cerebras has the best way is an absurd appeal to authority. None of the major players are using them, since they are incapable of handling their needs. The instant responses are nice and are a way for mistral to differentiate itself, but their models are not as good as those from others and few people use them, which is why Cerebras has the capacity to handle their needs.

From a historical standpoint, Cerebras is very similar to Thinking Machines Corporation, which went out of business after 11 years when there was a market downturn because they could not secure business. Cerebras is hemorrhaging money and is only in business because they found some investors willing to cover their losses. Once they run out of people willing to give them money (likely during the next AI winter), they will become insolvent, no matter how good their technology is. When the next AI winter hits, Mistral will likely become insolvent too, since they similarly are hemorrhaging money and are only in business because they found some investors willing to cover their losses.

By the way, you are lecturing someone who actually has worked on code for doing inference:

https://github.com/ryao/llama3.c

You are clearly on some sort of bender against Cerebras. I can tell from your comments that you are the same one guy with the same objections from Twitter, LinkedIn, Reddit. Why are you obsessed with them? I mean sure you seem to know your stuff but some of your assumptions as to why they aren't viable are clearly stretches on the negative side (not that they are impossible, it is just that you don't have the info, and the company being in a cutthroat competitive business has no obligation to share their proprietary business information). And being so well informed you ought to know these are stretches except you are blinded by some emotion for some reason. I mean, sure their solution has downsides (which implementation is perfect), but they will have opportunities to improve in future iterations as they adapt to what the market actually wants rather than what they projected years ago before there was a clear signal. For now, they are a startup with an interesting solution that has some momentum in the marketplace. It is to be seen how they fare but your certainty that they won't succeed is certainly not warranted by the data. And oh, their customers include: Mistrial, Perplexity, Meta, IBM. All those know that the CEO pleaded guilty to accounting charges 18 years ago, after which he has worked continuously in the tech industry including at AMD. A bunch of blue-chip tech investors from OpenAI, AMD, the present CEO of Intel, etc invested with him knowing this. Please give it a rest.
Yes, I don't optimize inference at all myself.

I will have to think through your comment, but won't be able to do so properly this month.