| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Veedrac 2364 days ago

A bit baffled by this because on every axis I look this seems like a dream of a compilation target.

* No DRAM or caches, everything is in SRAM, and all local SRAM loads are 1 cycle.

* Model parallel alone is full performance, no need for data parallel if you size to fit.

* Defects are handled in hardware; any latency differences are hidden & not in load path anyway.

* Fully asynchronous/dataflow by default, only need minimal synchronization between forward/backward passes.

I genuinely don't know how you'd build a simpler system than this.

3 comments

jcranmer 2364 days ago

Having worked on compilers for pretty weird architectures, it's generally the case that the less like a regular CPU your architecture is, the more difficult it is to compile.

In particular, when you change the system from having to worry about how to optimally schedule a single state machine to having to place operations on a fixed routing grid (à la FPGA), the problem becomes radically different, and any looping control flow becomes an absolute nail-biter of an issue.

link

Veedrac 2364 days ago

Remember that you aren't compiling arbitrary programs. Neural nets don't really have any local looping control flow, in the sense that data goes in one end and comes out the other. You'll have large-scale loops over the whole network, and each core might have a loop over small, local arrays of data, but you shouldn't have any sort of internal looping involving different parts of the model.

link

tachyonbeam 2364 days ago

It's pretty common to have neural networks that have both recurrent nets processing text input and convolutional layers. A classic example would be visual question answering (is there a duck in this picture?). That would be a simple example involving looping over one part of the model. Ideally you want that looping to be done as locally as possible to avoid wasting time having a program on a CPU dispatching, waiting for results and controlling data flow.

Having talked to someone at Cerebras, I also know that they don't just want to do inference with this, they want to accelerate training as well. That can involve much more complex control flow than you think. Start reading about automatic differentiation and you will soon realize that it's complex enough to basically be its own subfield of compiler design. There have been multiple entire books written on the topic, and I can guarantee you there can be control-flow driven optimizations in there (eg: if x == 0 then don't compute this large subgraph).

link

Veedrac 2364 days ago

I would be surprised if Cerebras was trying to handle any recurrence inside the overall forward/backward passes. It seems like a lot of difficulty (as mentioned) for peanuts.

I don't get your point about training. Yes, it's backwards rather than forwards, and yes it often has fancy stuff intermixed (dropout, Adam, ...), but these are CPUs, they can do that as long as it fits the memory model.

link

IshKebab 2364 days ago

I'm afraid recursivecaveat is right. This is an insanely difficult compilation target. I think you're possibly talking about a different kind of "compilation" - i.e. the Clang/GCC bit that converts C++ to machine code. That is indeed trivial. But "compilation" for these chips includes much more than that.

The really complicated bit is converting the tensorflow model to some kind of computation plan. Where do you put all the tensor data? How do you move it around the chip. It's insanely complicated. If anything kills Cerebras it will be the software.

link

Veedrac 2364 days ago

It's model parallel, so the first thing you do is lay out your floorplan for the model, which looks like this.

https://secureservercdn.net/198.12.145.239/a7b.fcb.myftpuplo...

Then you put your data next to the core that uses it. Simples.

(Optimal placement is tricky, but approximate techniques work fine.)

link

IshKebab 2364 days ago

When you consider the things that that diagram doesn't show, it doesn't look at all simple. Does that graph even have training? It'll have to be pipelined too. Probably will have to use recomputation due to the shortage of memory. What about within the boxes? You can't nicely separate a matmul into pieces like that.

I work on something similar but less ambitious, trust me it is crazy complicated.

link

Veedrac 2363 days ago

Could you be more explicit? What about the naïve approach to training (same graph but backwards, computing gradients) is going to fail?

Wrt. matmul, if you couldn't split them up, today's AI accelerators wouldn't work full stop. But regardless, even if it was much more complex on CS-1 than on all the other sea-of-multipliers accelerators, it's obviously a problem they've solved and so irrelevant to the compilation issue.

link

jhj 2364 days ago

It's not like there is one SRAM, there are many SRAMs, so you get the same problem as NUMA but a thousand fold. Some computations you can map to a regular grid/hypercube/whatever quite easily, but it is unclear what the interconnect between the PEs is here, or what this thing has for a NOC or NOCs, how routing is handled, etc., and further complicating the issue is compensating for any damaged PEs or damaged routes.

link

Veedrac 2364 days ago

No, you don't have all the issues with traditional NUMA because you aren't doing the same sort of heterogeneous workloads. You're always working on local data, and streaming your outputs to the next layer. This isn't a request-response architecture; such a thing wouldn't scale.

link

jhj 2356 days ago

It is more or less the same, it's just that in NUMA you have a limited number of localities, except here it is in the thousands. The issue is one of scheduling that locality. Some process still needs to determine what data is actually local and where it should "flow". Because it can't all fit in one place, the computation needs to be tiled (potentially in multiple ways) and the tiles need to be scheduled to move around in an efficient manner.

link