Hacker News new | ask | show | jobs
by fnordpiglet 621 days ago
I think the wafer itself isn’t the whole deal. If you watch their videos and read the link you posted the wafer size allows them to stack them in a block with integrated power and cooling at a higher density than blades and attach enormous amounts of memory. Not including the system, cooling, cluster, etc seems like a relatively unfair comparison too given the node includes all of those things - which are very expensive when considering enterprise grade data center hardware.

I don’t think their value add is simple “single wafer” with all other variables the same. In fact I think the block and system that gets the most out of that form factor is the secret sauce and not as easily replicated - especially since the innovations are almost certainly protected by an enormous moat of patents and guarded by a legion of lawyers.

2 comments

At the end of the day, Cerebras has not submitted any MLPerf results (of which I am aware). That means they are hiding something. Something not very competitive.

So, performance is iffy. Density for density sake doesn’t matter since clusters are power limited.

Nothing for the training part of MLPerf's benchmark. If they're competing just on inference, then they have stiff competition from specialized NPU-for-inference makers like Hailo (see: it's even part of the official Raspberry Pi AI kit), Qualcomm, tons of other players, and also some players using optics instead of electrons for inference such as Lightmatter, and also SIMD on highly abundant CPU servers which are never in shortage unlike GPUs (and have recently gotten support for specialized inference ops besides simply SIMD ones).
This isn't a benchmark, it's a press release. MLPerf has an inference component so they could have released numbers, but they chose not to.

At the end of the day it's all about performance per dollar/TCO, too, not just raw perf. A standardized benchmark helps to evaluate that.

My guess is that they neglected the software component (hardware guys always disdain software) and have to bend over backwards to get their hardware to run specific models (and only those specific models) well. Or potentially common models don't run well because their cross-chip interconnect is too slow.

MLPerf brings in exactly zero revenue. If they have sold every chip they can make for the next 2+ years, why would they be diverting resources to MLPerf benchmarking?

Artificial analysis does good API provider inference benchmarking and has evaluated Cerebras, Groq, Sambanova, the many Nvidia-based solutions, etc. IMO it makes way more sense to benchmark actual usable end points rather than submit closed and modified implementations to mlcommons. Graphcore had the fastest BERT submission at one point (when BERT was relevant lol) and it didn't really move the needle at all.

With Artificial Analysis I wonder if model tweaks are detectable. That’s the benefit of a standardized benchmark, you’re testing the hardware. If some inference vendor changes Llama under the hood, the changes are known. And of course if you don’t include precise repro. instructions in your standardized benchmark, nobody can tell how much money you’re losing (that is, how many chops are serving your requests).
I guess it's a software problem.

Without optimized implementations their performance will look like shit, even if their chip were years ahead of the competition.

Building efficient implementations with an immature ecosystem and toolchain doesn't sound like a good time. But yeah, huge red flag. If they can't get their chip to perform there's no hope for customers.

This hypothesis is an eerily exact instance of the tinygrad (tinycorp) thesis, along the lines of

“nvidia’s chip is better than yours. If you can’t make your software run well on nvidia’s chip, you have no hope of making it run well on your chip, least of all the first version of your chip.”

That’s why tinycorp is betting on a simple ML framework (tinygrad, which they develop and make available open source) whose promise is, due to the few operations needed by the framework: it’ll be very easy to get this software to run on a (eg your) new chip and then you can run ML workloads.

I’m not a (real) expert in the field but find the reasoning compelling. And it might be a good explanation for the competition for nvidia existing in hardware, but seemingly not in reality (ie including software that does something with it).

> That’s why tinycorp is betting on a simple ML framework (tinygrad, which they develop and make available open source) whose promise is, due to the few operations needed by the framework: it’ll be very easy to get this software to run on a (eg your) new chip and then you can run ML workloads.

This sounds easy in theory, but in reality, based on current models, the implementations are often tuned to make them work fast on the chip. As an engineer in the ML compiler space, I think this idea of just using small primitives, which comes from the compiler / bytecode world, is not going to yield acceptable performance.

Often enough, hardware-specific optimizations can be performed automatically by the compiler. On the flip side, depending on a small set of general-purpose primitives makes it easier to apply hardware-agnostic optimization passes to the model architecture. There are many efforts that are ultimately going in this direction, from Google's Tensorflow to the community project Aesara/PyTensor (née Theano) to the MLIR intermediate representation from the LLVM folks.
I'm a compiler engineer at a GPU company, and while tiny grad kernels might be made more performant by the JIT compiler underlying every GPU chips stack, oftentimes, a much bigger picture is needed to properly optimize all the chip's resources. The direction that companies like NVIDIA et al are going in involves whole model optimization, so I really don't see how tiny grad can be competitive here. I see it most useful in embedded, but Hotz is trying to make it a thing for training. Good luck.

> There are many efforts that are ultimately going in this direction, from Google's Tensorflow to the community project Aesara/PyTensor (née Theano) to the MLIR intermediate representation from the LLVM folks.

The various GPU companies (AMD, NVIDIA, Intel) are some of the largest contributors to MLIR, so saying that they're going in the direction of standardization is not wholly true. They're using MLIR as a way to share optimizations (really to stay at the cutting edge), but, unlike tiny grad, MLIR has a much higher level overview of the whole computation and the company's backends will thus be able to optimize over the whole model.

If tiny grad were focused on MLIR's ecosystem I'd say they had a fighting chance of getting NVIDIA-like performance, but they're off doing their own thing.

Yes, sure. I'm occasionally reading up on what George Hotz is doing with tinygrad and him ranting about AMD hardware certainly has influenced my opinion on non-Nvidia hardware to some degree - even though I take his opinion with a grain of salt, he and his team are clearly encountering some non-trivial issues.

I would love to try some of the stuff I do with CUDA on AMD hardware to get some first-hand experience, but it's a though sell: They are not as widely available to rent and telling my boss to order a few GPUs, so we can inspect that potential mess for ourselves is not convincing either.

Can their system attach memory? from what I read, it doesn't seem to be able to: https://www.reddit.com/r/mlscaling/comments/1csquky/with_waf...
I think they do have external memory that they use for training.
Former Cerebras engineer. At the time I was there, it could not.
Surprising. DRAM (and more importantly high-bandwidth DRAM) seems to be scaling significantly better than SRAM -- and I'm not sure if that could be seriously expected to shift.