Hacker News new | ask | show | jobs
by TomVDB 1381 days ago
With no virtual memory, no caches, and interface processors instead of direct access to external DRAM, this thing must be a programming nightmare?

Having tons of small CPUs with fast local SRAM is of course not a new idea. Back in 1998, I talked to a startup that believed it could replace standard cell ASIC design with tiny CPUs that had custom instruction sets. (I didn't believe it could: it's extremely area inefficient and way to power hungry for that kind of application. The startup went nowhere.) And the IBM Cell is indeed an obvious inspiration.

But AFAIK, the IBM Cell was hard to program. I've seen PS3 presentations where it was primarily used as a software defined GPU, because it was just too difficult to use as a general purpose processor.

Now NOT being a general purpose process is the whole point of Dojo, so maybe they can make it work. But from my limited experience with CUDA, virtual memory and direct access to DRAM is a major plus, even if the high performance compute routines make intensive use of shared memory. The fact that an interface processor is involved (how?) in managing your local SRAM must make synchronization much more complex than with CUDA, where everything is handled by the same SM that manages the calculations: your warp issues a load, it waits on a barrier, the calculations happens, sometimes in a side unit in which case you again wait on a barrier, you offload the data and wait on a barrier. And while one warp waits on a barrier, another warp can take over. It's pretty straightforward.

The Dojo model suggests that "wait on a barrier" becomes "wait on the interface processor".

6 comments

If it only ever runs one program, and that program is an implementation of vanilla Transformers, that might be all it needs to be useful. Sufficiently large Transformers can do an incredible variety of tasks. If someone invents something better than vanilla Transformers, then they can write a second program for that.
Also investing in a branch predictor when the intended workload doesn't seem at all scalar is a confusing choice to me. Also the 362 F16 TFLOPs sounds super impressive, except the memory bandwidth is I think 800 GB/s (or is it 5 times that? Or effectively less than that if data has to be passed along multiple hops? I'm a bit confused), which means having to do 1000 ops (or 200? or more?) on each 16 bit value loaded in. Maybe you could do that, but it feels like you'd probably end up bandwidth bound most of the time.
My understanding is they load in weights occasionally into sram and then pump in training data on the sides of the die and have multiple cores operate on a wavefront of data. So the cores don't compete for host memory bandwidth because the same data flows (transformed) through multiple cores.
You are right that this won't work well with any language that assumes a "normal" processor. But a small language that is written for it could be fine.
From my understanding the CELL was meant to be the GPU for the PS3 but Sony instead found the same issues and could not program a reasonable performing SDK using it within the time limits (MS Xbox 360) and added in a Nvidia RSX GPU.

Another oddball architecture that went nowhere.

> could not program a reasonable performing SDK using it within the time limits

It feels like "within the time limits" has always been the problem of difficult to program for software-dependant architectures: time vs competitors.

E.g. in the time it takes to write an intelligent compiler (IA-64), your better-resourced (because they're getting revenue from the current market) competitor has surpassed your performance via brute evolution force.

There are use cases out there (early supercomputing, NVIDIA) where radical, development-heavy architectures have been successful, but they generally lack of competitor (the former) or iterate ruthlessly themselves (the latter).

"radical, development-heavy architectures" = niche use case

Connection machine = only had one customer afaik (NSA)

Transmeta - interesting technology but nobody in that market wanted to run anything besides Windows+x86.

Sounds to me like a programming dream. The usual way of things these days is 'don't waste your employer's time trying to optimize; everything that can profitably be done has already been done by other people; you just have to accept that particular part of your skill set is useless'. Dojo would let you actually use a lot more of your skills.
What programming when it's a model being run?