Hacker News new | ask | show | jobs
by recursivecaveat 2361 days ago
As someone who works for another startup in this area, building the chip is only half the battle. The other half is tooling for compiling benchmark networks onto the chip in a performant manner. With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made. It probably stacks up absolutely terribly in every metric right now. That's not to say it will necessarily get better, most of the people I've talked to don't think the megachip will ultimately amount to much more than a clever marketing ploy.
3 comments

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.

Remember that you aren't compiling arbitrary programs. Neural nets don't really have any local looping control flow, in the sense that data goes in one end and comes out the other. You'll have large-scale loops over the whole network, and each core might have a loop over small, local arrays of data, but you shouldn't have any sort of internal looping involving different parts of the model.
It's pretty common to have neural networks that have both recurrent nets processing text input and convolutional layers. A classic example would be visual question answering (is there a duck in this picture?). That would be a simple example involving looping over one part of the model. Ideally you want that looping to be done as locally as possible to avoid wasting time having a program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).

I would be surprised if Cerebras was trying to handle any recurrence inside the overall forward/backward passes. It seems like a lot of difficulty (as mentioned) for peanuts.

I don't get your point about training. Yes, it's backwards rather than forwards, and yes it often has fancy stuff intermixed (dropout, Adam, ...), but these are CPUs, they can do that as long as it fits the memory model.

I'm afraid recursivecaveat is right. This is an insanely difficult compilation target. I think you're possibly talking about a different kind of "compilation" - i.e. the Clang/GCC bit that converts C++ to machine code. That is indeed trivial. But "compilation" for these chips includes much more than that.

The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

When you consider the things that that diagram doesn't show, it doesn't look at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about within the boxes? You can't nicely separate a matmul into pieces like that.

I work on something similar but less ambitious, trust me it is crazy complicated.

Could you be more explicit? What about the naïve approach to training (same graph but backwards, computing gradients) is going to fail?

Wrt. matmul, if you couldn't split them up, today's AI accelerators wouldn't work full stop. But regardless, even if it was much more complex on CS-1 than on all the other sea-of-multipliers accelerators, it's obviously a problem they've solved and so irrelevant to the compilation issue.

It's not like there is one SRAM, there are many SRAMs, so you get the same problem as NUMA but a thousand fold. Some computations you can map to a regular grid/hypercube/whatever quite easily, but it is unclear what the interconnect between the PEs is here, or what this thing has for a NOC or NOCs, how routing is handled, etc., and further complicating the issue is compensating for any damaged PEs or damaged routes.
No, you don't have all the issues with traditional NUMA because you aren't doing the same sort of heterogeneous workloads. You're always working on local data, and streaming your outputs to the next layer. This isn't a request-response architecture; such a thing wouldn't scale.
It is more or less the same, it's just that in NUMA you have a limited number of localities, except here it is in the thousands. The issue is one of scheduling that locality. Some process still needs to determine what data is actually local and where it should "flow". Because it can't all fit in one place, the computation needs to be tiled (potentially in multiple ways) and the tiles need to be scheduled to move around in an efficient manner.
Is it not the case that the defect identification and rerouting happens at the hw level in a QA phase post production? If not I'm even less bullish on cerebras.
Yes, that's what their web site says.
With 400k cores and their 'duplicate and re-route' defect strategy, this might literally be the most challenging compilation target ever made.

While I'd be generally skeptical, it seems like the compilation for the rerouting could be done on a single low level, below whatever their assembler is, and so the could just look like a regular array of cores - just a single array that translates from i to the ith "real" core and similar structures seems like it could be enough.

Edit: I mean, if they're smart, it seems like they'd make the thing look as much as possible like a generic GPU capable of OpenCL. I have no idea if they'll do that but since they have size, they won't have to sell their stuff an otherwise custom approach.