|
|
|
|
|
by impossiblefork
385 days ago
|
|
While the CEO stuff is a problem, I don't think the other stuff matters. Per chip area WSE-3 is only a little bit more expensive than H200. While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3. In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories. The only effect is that the WSE-3s will have a minimum demand before they make sense, whereas an H200 will make sense even with little demand. |
|
> While you may need several WSE-3s to load the model, if you have enough demand that you are running the WSE-3 at full speed you will not be using more area in the WSE-3.
You need ~20 wafers to run the Llama 4 Behemoth model on Cerebras hardware. This is close to a million mm^2. The Nvidia hardware that they used in their comparison should have less than 10,000 mm^2 die area, yet can run it fine thanks to the external DRAM. How is the CSE-3 not using more die area?
> In fact, the WSE-3 may be more efficient, since it won't be loading and unloading things from large memories.
This makes no sense to me. Inference software loads the model once and then uses it multiple times. This should be the same for both Nvidia and Cerebras.