| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by elseless 371 days ago

Sure. You can think of a (simple) traditional CPU as executing instructions in time, one-at-a-time[1] — it fetches an instruction, decodes it, performs an arithmetic/logical operation, or maybe a memory operation, and then the instruction is considered to be complete.

The Efficient architecture is a CGRA (coarse-grained reconfigurable array), which means that it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units. Of course, the graph may not all fit onto the hardware at once, in which case it must also be split up to run in batches over time. But the key difference is that there’s this sort of spatial unrolling that goes on.

This means that a lot of the work of fetching and decoding instructions and data can be eliminated, which is good. However, it also means that the program must be mostly, if not completely, static, meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU. So even if the compiler claims to support C++/Rust/etc., it probably does not support, e.g., pointers or dynamically-allocated objects as we usually think of them.

[1] Most modern CPUs don’t actually execute instructions one-at-a-time — that’s just an abstraction to make programming them easier. Under the hood, even in a single-core CPU, there is all sorts of reordering and concurrent execution going on, mostly to hide the fact that memory is much slower to access than on-chip registers and caches.

5 comments

pclmulqdq 371 days ago

Pointers and dynamic objects are probably fine given the ability to do indirect loads, which I assume they have (Side note: I have built b-trees on FPGAs before, and these kinds of data structures are smaller than you think). It's actually pure code size that is the problem here rather than specific capabilities, as long as the hardware supports those instructions.

Instead of assembly instructions taking time in these architectures, they take space. You will have a capacity of 1000-100000 instructions (including all the branches you might take), and then the chip is full. To get past that limit, you have to store state to RAM and then reconfigure the array to continue computing.

link

elseless 371 days ago

Agree that code size is a significant potential issue, and that going out to memory to reprogram the fabric will be costly.

Re: pointers, I should clarify that it’s not the indirection per se that causes problems — it’s the fact that, with (traditional) dynamic memory allocation, the data’s physical location isn’t known ahead of time. It could be cached nearby, or way off in main memory. That makes dataflow operator latencies unpredictable, so you either have to 1. leave a lot more slack in your schedule to tolerate misses, or 2. build some more-complicated logic into each CGRA core to handle the asynchronicity. And with 2., you run the risk that the small, lightweight CGRA slices will effectively just turn into CPU cores.

link

pclmulqdq 371 days ago

Oh, many embedded architectures don't have a cache hierarchy and instead place dynamic objects in one SRAM. Access latency is constant anywhere you go.

link

kannanvijayan 371 days ago

Hmm. You'd be able to trade off time for that space by using more general configurations that you can dynamically map instruction-sequences onto, no?

The mapping wouldn't be as efficient as a bespoke compilation, but it should be able to avoid the configuration swap-outs.

Basically a set of configurations that can be used as an interpreter.

link

markhahn 371 days ago

I think that footnote is close to the heart of it: on a modern OoO superscalar processor, there are hundreds of instructions in-flight. that means a lot of work done to maintain their state and ensure that they "fire" when their operands are satisfied. I think that's what this new system is about: a distributed, scalable dataflow-orchestration engine.

I think this still depends very much on the compiler: whether it can assemble "patches" of direct dependencies to put into each of the little processing units. the edges between patches are either high-latency operations (memory) or inter-patch links resulting from partitioning the overall dataflow graph. I suspect it's the NOC addressing that will be most interesting.

link

esperent 371 days ago

> it executes instructions in space instead of time. At compile time, the Efficient compiler looks at a graph made up of all the “unrolled” instructions (and data) in the program, and decides how to map it all spatially onto the hardware units.

Naively that sounds similar to a GPU. Is it?

link

3836293648 370 days ago

No? GPUs are just extremely parallel much wider SIMD cores

link

drcongo 370 days ago

You managed to explain that in a way that even I could understand. Magnificent, thank you.

link

majkinetor 370 days ago

> meaning there’s a very limited ability for data-dependent branching, looping, etc. to occur compared to a CPU

Not very useful then if I can't do this very basic thing?

link