| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by impossiblefork 385 days ago

While the CEO stuff is a problem, I don't think the other stuff matters.

Per chip area WSE-3 is only a little bit more expensive than H200. While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3. In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

The only effect is that the WSE-3s will have a minimum demand before they make sense, whereas an H200 will make sense even with little demand.

1 comments

ryao 385 days ago

I did the math last year to estimate how many wafers per year Nvidia had, and from my recollection it was >50,000. Cerebras with their ~300 per year is not able to handle the inference needs of the market. It does not help that all of their memory must be inside the wafer, which limits the amount of die area they have for actual logic. They have no prospect for growth unless TSMC decides to bless them or they switch to another foundation.

> While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3.

You need ~20 wafers to run the Llama 4 Behemoth model on Cerebras hardware. This is close to a million mm^2. The Nvidia hardware that they used in their comparison should have less than 10,000 mm^2 die area, yet can run it fine thanks to the external DRAM. How is the CSE-3 not using more die area?

> In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.

This makes no sense to me. Inference software loads the model once and then uses it multiple times. This should be the same for both Nvidia and Cerebras.

link

impossiblefork 385 days ago

Yes, on an ordinary GPU it loads the weights to GPU memory, but then these weights must be moved from GPU memory onto the chip. But on these the weights can presumably be kept on chip entirely-- that's basically their whole point, so with the Cerebras there's no need to ever move weights to the chip.

Of course these guys depend on getting chips, but so does everybody. I don't know how difficult it is, but all sorts of entities get TSMC 5nm. Maybe they'll get TSMC 3nm and 2nm later than NVIDIA, but it's also possible that they don't.

link

ryao 385 days ago

The CSE-3 is divided into 900,000 PEs, which each have only 48kB of RAM:

https://hc2024.hotchips.org/assets/program/conference/day2/7...

Similarly, the SMs in Blackwell have up to 228kB of RAM:

https://docs.nvidia.com/cuda/archive/12.8.0/pdf/Blackwell_Tu...

If you need anything else, you need to load it from elsewhere. In the CSE-3, that would be from other PEs. In Blackwell, that would be from on package DRAM. Idle time in Blackwell be mitigated by parallelism, since each SM has SRAM for multiple kernels to run in parallel. I believe the CSE-3 is quick enough that they do not need that trick.

The other guy said “you will not be using more area in the WSE-3”. I do not see this die area efficiency. You need many full wafers (around 20 with Llama 4 Maverick) to do the same thing with the CSE-3 that can be done with a fraction of a wafer with Blackwell. Even if you include the DRAM’s die area, Nvidia’s hardware is still orders of magnitude more efficient in terms of die area.

The only advantage Cerebras has as far as I can see is that they are fast on single queries, but they do not dare advertise figures for their total throughput, while Nvidia will happily advertise those. If they were better than Nvidia at throughput numbers, Cerebras would advertise them, since that is what matters for having mass market appeal, yet they avoid publishing those figures. That is likely because in reality, they are not competitive in throughput.

To give an example of Nvidia advertising throughput numbers:

> In a 1-megawatt AI factory, NVIDIA Hopper generates 180,000 tokens per second (TPS) at max volume, or 225 TPS for one user at the fastest.

https://blogs.nvidia.com/blog/ai-factory-inference-optimizat...

Cerebras strikes me as being like Bugatti, which designs cars that go from start to finish very fast at a price that could buy dozens of conventional vehicles, while Nvidia strikes me as being like Toyota, which designs far lower vehicles, but can manufacture them in a volume that is able to handle a large amount of the world’s demand for transport. Bugatti can make enough vehicles to bring a significant proportion of the world from A to B regularly, while Toyota can. Similarly, Cerebras cannot make enough chips to handle any significant proportion of the world’s demand for inference, while Nvidia can.

link

impossiblefork 385 days ago

I don't really see how NVIDIA shipping so many chips matters. If more people want Cerebras chips they will presumably be manufactured.

I agree that Cerebras manufacture <300 wafers per year. Probably around 250-300, calculated from $1.6-2 million per unit and their 2024 revenue.

I don't really see how that matters though. I don't see how core counts matter, but I assume that Cerebras is some kind of giant VLIW-y thing where you can give different instructions to different subprocessors.

I imagine that the model weights would be stored in little bits on each processor and that it does some calculation and hands it on.

Then you never need to load the the weights, the only thing you're passing around is activations with them going from wafer 1, to wafer 2, etc. to wafer 20. When this is running at full speed, I believe that this can be very efficient, better than a small GPU like those made by NVIDIA.

Yes, a lot of the area will be on-chip memory/SRAM, but a lot of it will also be logic and that logic will be computing things instead of being used to move things from RAM to on-chip memory.

I don't have any deep knowledge of this system, really, nothing beyond what I've explained here, but I believe that Mistral are using these systems because they're completely superb and superior to GPUs for their purposes, and they will made a carefully weighed decision based on actual performance and actual cost.

link

ryao 385 days ago

You replied really quickly when I had thought I could sneak in a revision, which dropped the estimates for production numbers. In any case, the Cerebras CSE-3 is extremely inefficient for what it does. Inference is memory bandwidth bound, such that peak performance for a single query should be close to the memory bandwidth divided by the weights. Despite having. 2600x the memory bandwidth, they can only perform 2.5 times faster. 1000x of their supposed memory bandwidth is wasted. There are extreme inefficiencies in their architecture. Meanwhile, Nvidia is often within >80% of what memory bandwidth divided by weights predict their hardware can do.

Mistral is a small fish in the grander scheme of things. I would assume that using Cerebras is a way to try to differentiate themselves in a market where they are largely ignored, which is the reason Mistral is small enough to be able to have their needs handled by Cerebras. If they grow to OpenAI levels, there is no chance of Cerebras being able to handle the demand for them.

Finally, I had researched this out of curiosity last year. I am posting remarks based on that.

link

impossiblefork 385 days ago

Inference is memory bandwidth bound on a GPU, which has very little on-chip memory.

On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.

This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.

It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.

link