|
With no virtual memory, no caches, and interface processors instead of direct access to external DRAM, this thing must be a programming nightmare? Having tons of small CPUs with fast local SRAM is of course not a new idea. Back in 1998, I talked to a startup that believed it could replace standard cell ASIC design with tiny CPUs that had custom instruction sets. (I didn't believe it could: it's extremely area inefficient and way to power hungry for that kind of application. The startup went nowhere.) And the IBM Cell is indeed an obvious inspiration. But AFAIK, the IBM Cell was hard to program. I've seen PS3 presentations where it was primarily used as a software defined GPU, because it was just too difficult to use as a general purpose processor. Now NOT being a general purpose process is the whole point of Dojo, so maybe they can make it work. But from my limited experience with CUDA, virtual memory and direct access to DRAM is a major plus, even if the high performance compute routines make intensive use of shared memory. The fact that an interface processor is involved (how?) in managing your local SRAM must make synchronization much more complex than with CUDA, where everything is handled by the same SM that manages the calculations: your warp issues a load, it waits on a barrier, the calculations happens, sometimes in a side unit in which case you again wait on a barrier, you offload the data and wait on a barrier. And while one warp waits on a barrier, another warp can take over. It's pretty straightforward. The Dojo model suggests that "wait on a barrier" becomes "wait on the interface processor". |