|
|
|
|
|
by ryao
384 days ago
|
|
You replied really quickly when I had thought I could sneak in a revision, which dropped the estimates for production numbers. In any case, the Cerebras CSE-3 is extremely inefficient for what it does. Inference is memory bandwidth bound, such that peak performance for a single query should be close to the memory bandwidth divided by the weights. Despite having. 2600x the memory bandwidth, they can only perform 2.5 times faster. 1000x of their supposed memory bandwidth is wasted. There are extreme inefficiencies in their architecture. Meanwhile, Nvidia is often within >80% of what memory bandwidth divided by weights predict their hardware can do. Mistral is a small fish in the grander scheme of things. I would assume that using Cerebras is a way to try to differentiate themselves in a market where they are largely ignored, which is the reason Mistral is small enough to be able to have their needs handled by Cerebras. If they grow to OpenAI levels, there is no chance of Cerebras being able to handle the demand for them. Finally, I had researched this out of curiosity last year. I am posting remarks based on that. |
|
On WSE-3s however, there's enough memory that the model can actually be stored on-chip provided that you have a sufficient number of them. 20 are enough for some of the largest open models.
This, depending on how it's set up, allows more efficient use of what logic is available, for actually doing computations instead of just loading and unloading the weights. This can potentially make a system like this much more efficient than a GPU.
It doesn't matter whether Mistral are small fish or not. I don't agree that they are small fish, but whether or not they are they are experts. They are very capable people. They haven't chosen Cerebras to be different, they've chosen it because they believe it's the best way to do inference.