| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by keveman 2539 days ago

> How is this programmed?

Full disclosure: I am a Cerebras employee.

There is extensive support for TensorFlow. A wide range of models expressed in the TensorFlow will be accelerated transparently.

5 comments

paulsutter 2539 days ago

He was asking about the implications for yields. Do you route around bad dies/cores, and what are the implication for programming and performance?

For everyone else: normally a wafer is divded into dies, each of which (loosely) are a chip. Yield is a percentage of good parts and it's very unlikely that an entire wafer is good. Gene Amdahl estimated that 99.99% yield is needed for successful wafer scale integration:

https://en.wikipedia.org/wiki/Wafer-scale_integration

Veedrac 2539 days ago

> For example, the typical 300mm wafer from TSMC may contain “a modest hundred number of flaws,” said Feldman. Cerebras gave its Swarm interconnect redundant links to route around defective tiles and allocated “a little over 1% [of the tiles] as spares.”

https://www.eetimes.com/document.asp?doc_id=1335043&page_num...

gwern 2539 days ago

Looking at the whitepaper, I'm a little surprised how little RAM there is for such an enormous chip. Is the overall paradigm here that you still have relatively small minibatches during training, but each minibatch is now vastly faster?

ivalm 2539 days ago

IIRC they use batch size = 1 and each core only know about one layer. Which is to say this thing has to be trained very differently from normal SGD (but requires very little memory). There is also the issue that they rely on sparseness, which you get with relu activations, but if, for example, language models move to gelu activations they will be somewhat screwed.

IshKebab 2539 days ago

It's because it's SRAM, not DRAM. Think how much L3 cache your processor has. A few MB probably. That's what this chip's memory is equivalent to.

morphle 2539 days ago

We have up to 160 GB SRAM on our WSI. The rest of the transistors can be a few million cores or reconfigurable Morphle Logic (an open hardware kind of FPGA)

Our startup has been working on a full Wafer Scale Integration since 2008. We are searching for cofounders. Merik at metamorphresearch dot org

Veedrac 2539 days ago

“full utilization at any batch size, including batch size 1”

https://www.cerebras.net/

gwern 2539 days ago

That doesn't really mean anything. It (and any other chip) had better be able to run at least batch size 1, and lots of people claim to have great utilization... It doesn't tell me if the limited memory is part of a deliberate tradeoff akin to a throughput/latency tradeoff, or some intrinsic problem with the speedups coming from other design decisions like the sparsity multipliers, or what.

Veedrac 2539 days ago

Most of the chip is already SRAM, I'm not really sure what else you would expect?

18 GiB × 6 transistors/bit ≈ .93 trillion transistors

gwern 2539 days ago

Well, it could be... not SRAM? It's not the only kind of RAM, and the choice to use SRAM is certainly not an obvious one. It could make sense as part of a specific paradigm, but that is not explained, and hence why I am asking. It may be perfectly obvious to you, but it's not to me.

Veedrac 2539 days ago

You basically have the option between SRAM, HBM (DRAM), and something new. You can imagine the risks with using new memory tech on a chip like this.

The issue with HBM is that it's much slower, much more power hungry (per access, not per byte), and not local (so there are routing problems). You can't scale that to this much compute.

zackmorris 2539 days ago

Do you support MATLAB or GNU Octave? I'm looking for the level of abstraction below TensorFlow because I find pure matrix math to be more approachable. Admittedly, I'm not super experienced with TF, so maybe it can encapsulate them.

Also, do you have a runtime to run the chip as a single 400,000 core CPU with some kind of memory mapped I/O so that a single 32 or 64 bit address space writes through to the RAM router through virtual memory? I'm hoping to build a powerful Erlang/Elixer or Go machine so I can experiment with other learning algorithms in realtime, outside the constraints of SIMD-optimized approaches like neural nets. Another option would be 400,000 virtual machines in a cluster, each running a lightweight unix/linux (maybe Debian or something like that). Here is some background on what I'm hoping for:

https://news.ycombinator.com/item?id=20601699

See my other comments for more. I've been looking for a parallel machine like this since I learned about FGPAs in the late 90s, but so far have not had much success finding any.

streetcat1 2539 days ago

So why are you not publishing benchmarks against nvidia?

sanxiyn 2539 days ago

Cerebras is an MLPerf member, so they will publish MLPerf numbers some day and then we will talk.

streetcat1 2539 days ago

They probably run the benchmark (I guess many times, and not only against nvidia). But yet it is not in the white paper.

I was an SE at an hardware company and it is the first thing that you do as a product manager.

McP 2539 days ago

What is an SE?

streetcat1 2539 days ago

Software engineer.

The_rationalist 2539 days ago

How do you achieve this? Tensorflow does not support openCL.

rrss 2539 days ago

I'm sure they wrote a new backend for tensorflow that targets their API. Since the hardware is only for ML, it wouldn't make sense for them to bother trying to implement OpenCL.