Hacker News new | ask | show | jobs
by knowitnone 621 days ago
NVIDIA is pretty established but there's also Intel, AMD, Google to contend with. Sure Cerebras is unique in that they make one large chip out of the entire wafer but nothing prevents these other companies from doing the same thing. Currently they are choosing not to because of wafer economics but if they chose to, Cerebras would pretty much lose their advantage. https://www.servethehome.com/cerebras-wse-3-ai-chip-launched... 56x the size of H100 but only 8x the performance improvement isn't something I would brag about. I expected much higher performance since all processing is on one wafer. Something doesn't add up (I'm no system designer). Also, at $3.13 million per node, one could buy 100 H100s at $30k each (not including system, cooling, cluster, etc). Based on price/performance Cerebras loses IMO.
8 comments

I think the wafer itself isn’t the whole deal. If you watch their videos and read the link you posted the wafer size allows them to stack them in a block with integrated power and cooling at a higher density than blades and attach enormous amounts of memory. Not including the system, cooling, cluster, etc seems like a relatively unfair comparison too given the node includes all of those things - which are very expensive when considering enterprise grade data center hardware.

I don’t think their value add is simple “single wafer” with all other variables the same. In fact I think the block and system that gets the most out of that form factor is the secret sauce and not as easily replicated - especially since the innovations are almost certainly protected by an enormous moat of patents and guarded by a legion of lawyers.

At the end of the day, Cerebras has not submitted any MLPerf results (of which I am aware). That means they are hiding something. Something not very competitive.

So, performance is iffy. Density for density sake doesn’t matter since clusters are power limited.

Nothing for the training part of MLPerf's benchmark. If they're competing just on inference, then they have stiff competition from specialized NPU-for-inference makers like Hailo (see: it's even part of the official Raspberry Pi AI kit), Qualcomm, tons of other players, and also some players using optics instead of electrons for inference such as Lightmatter, and also SIMD on highly abundant CPU servers which are never in shortage unlike GPUs (and have recently gotten support for specialized inference ops besides simply SIMD ones).
This isn't a benchmark, it's a press release. MLPerf has an inference component so they could have released numbers, but they chose not to.

At the end of the day it's all about performance per dollar/TCO, too, not just raw perf. A standardized benchmark helps to evaluate that.

My guess is that they neglected the software component (hardware guys always disdain software) and have to bend over backwards to get their hardware to run specific models (and only those specific models) well. Or potentially common models don't run well because their cross-chip interconnect is too slow.

MLPerf brings in exactly zero revenue. If they have sold every chip they can make for the next 2+ years, why would they be diverting resources to MLPerf benchmarking?

Artificial analysis does good API provider inference benchmarking and has evaluated Cerebras, Groq, Sambanova, the many Nvidia-based solutions, etc. IMO it makes way more sense to benchmark actual usable end points rather than submit closed and modified implementations to mlcommons. Graphcore had the fastest BERT submission at one point (when BERT was relevant lol) and it didn't really move the needle at all.

With Artificial Analysis I wonder if model tweaks are detectable. That’s the benefit of a standardized benchmark, you’re testing the hardware. If some inference vendor changes Llama under the hood, the changes are known. And of course if you don’t include precise repro. instructions in your standardized benchmark, nobody can tell how much money you’re losing (that is, how many chops are serving your requests).
I guess it's a software problem.

Without optimized implementations their performance will look like shit, even if their chip were years ahead of the competition.

Building efficient implementations with an immature ecosystem and toolchain doesn't sound like a good time. But yeah, huge red flag. If they can't get their chip to perform there's no hope for customers.

This hypothesis is an eerily exact instance of the tinygrad (tinycorp) thesis, along the lines of

“nvidia’s chip is better than yours. If you can’t make your software run well on nvidia’s chip, you have no hope of making it run well on your chip, least of all the first version of your chip.”

That’s why tinycorp is betting on a simple ML framework (tinygrad, which they develop and make available open source) whose promise is, due to the few operations needed by the framework: it’ll be very easy to get this software to run on a (eg your) new chip and then you can run ML workloads.

I’m not a (real) expert in the field but find the reasoning compelling. And it might be a good explanation for the competition for nvidia existing in hardware, but seemingly not in reality (ie including software that does something with it).

> That’s why tinycorp is betting on a simple ML framework (tinygrad, which they develop and make available open source) whose promise is, due to the few operations needed by the framework: it’ll be very easy to get this software to run on a (eg your) new chip and then you can run ML workloads.

This sounds easy in theory, but in reality, based on current models, the implementations are often tuned to make them work fast on the chip. As an engineer in the ML compiler space, I think this idea of just using small primitives, which comes from the compiler / bytecode world, is not going to yield acceptable performance.

Often enough, hardware-specific optimizations can be performed automatically by the compiler. On the flip side, depending on a small set of general-purpose primitives makes it easier to apply hardware-agnostic optimization passes to the model architecture. There are many efforts that are ultimately going in this direction, from Google's Tensorflow to the community project Aesara/PyTensor (née Theano) to the MLIR intermediate representation from the LLVM folks.
Yes, sure. I'm occasionally reading up on what George Hotz is doing with tinygrad and him ranting about AMD hardware certainly has influenced my opinion on non-Nvidia hardware to some degree - even though I take his opinion with a grain of salt, he and his team are clearly encountering some non-trivial issues.

I would love to try some of the stuff I do with CUDA on AMD hardware to get some first-hand experience, but it's a though sell: They are not as widely available to rent and telling my boss to order a few GPUs, so we can inspect that potential mess for ourselves is not convincing either.

Can their system attach memory? from what I read, it doesn't seem to be able to: https://www.reddit.com/r/mlscaling/comments/1csquky/with_waf...
I think they do have external memory that they use for training.
Former Cerebras engineer. At the time I was there, it could not.
Surprising. DRAM (and more importantly high-bandwidth DRAM) seems to be scaling significantly better than SRAM -- and I'm not sure if that could be seriously expected to shift.
Correction: it's 8x the TFLOPS of a DGX (8 H100), not 1 H100. But it's true that if it stays at $3M it's probably too much and I don't think the memory bottleneck on gpus is large enough to justify this price/performance.
So, the corrected statement is:

"56x the size of H100 but only 64x the performance improvement"

Doesn't sound too shabby.

The company started in 2015 so I think they are (were?) banking on SRAM scaling better than it has in recent years.
If you have a problem that you can’t easily split up into 64 chunks, I guess it makes more sense, right?
> 56x the size of H100 but only 8x the performance improvement isn't something I would brag about.

It doesn't sound like it's too bad for a 9 year old company. Nvidia had a 20-year head start. I would expect that they will continue to shrink it and increase performance. At some point, that might become compelling?

Nvidia is also going to keep improving, so it will be a moving target.
That's true, but the advantage of having a head start does eventually diminish. They won't catch up to Nvidia in the next couple of years, but they could eventually be a real competitor.
Comparing a WSE-3 to a H100 without considering the systems they go in or the systems, cooling, networking, etc that supports them means little when doing cost analysis, be it CapEx or TCO. A better (but still flawed) comparison would be a DGX H200 (a cluster of H100's and their essential supporting infra) to a CS-3 (a cluster of WSE-3's and their essential supporting infra in a similar form factor/volume of a DGX H200).

Now, is Cerebras going to eventually beat Nvidia or at least compete healthily with Nvidia and other tech titans in the general market or a given lucrative niche of it? No idea. That'd be a cool plot twist, but hard to say. But it's worth acknowledging that investing in a company and buying their products are two entirely separate decisions. Much of silicon valleys success stories are a result of people investing in the potential of what they could become, not because they were already the best on the market, and for nothing else, Cerebras approach is certainly novel and promising.

> wafer economics

What are they?

Is this related to defects? Can't they disable parts of defective chip just like other CPUs do? Sounds cheaper than cutting up and packaging chips individually!

Process development, feature size, and ultimate yield are probably what theyre after. Yes, for the past 30+ years everyone has used a combination of disabling (“fusing”) unused/unreliable logic on the die. In addition everyone also “bins” the chips from the same wafer to different SKUs based on stable clock speed, available/fused components, test results, etc. This can be very effective in increasing yield and salable parts.

My recollection is that theres speculation cerebras is building in significant duplicate features to account for defects. They cant “bin” their wafers in the same way as packaged chips. That will reduce total yield/utilization of the surface area.

The actual packaging steps are relatively low tech/cost compared to the semiconductor manufacturing. Theyre commonly outsourced somwhere like malaysia or thailand.

Agreed, it just seems like Nvidia chips are going to be easier to produce at scale. Cerberas will be limited to a few niche use-cases, like HFT where hedge funds are using LLMs to analyze SEC filings as fast as possible.
Where/how did you learn of the hedge fund usages?
If the poster never comes back, I think it is fair to assume it is just a reasonable guess, right?
they don’t need an advantage, they just need orders and inventory

get extorted by nvidia sales people for a 2026 delivery date that gets pushed out if you say anything about it or decline cloud services

or another provider delivering earlier

thats what the market wants, and even then, who cares? this company is trying to IPO at whay valuation? this article didnt say but the last valuation was like $1.5bn? so you mean a 300x of delta between this and Nvidia’s valuation if these guys get a handful of orders? ok

At the end of the day it's all made in the same factory. If nVidia have problems delivering then so do Cerebras.
> Sure Cerebras is unique in that they make one large chip out of the entire wafer

I'm sure tgey test it thoroughly. /s