| Minimum energy usage is very dependent on not activating more circuits than strictly required for a given computation. However a conventional processor pipeline will usually fetch instructions and begin processing them, only to realize later on that they were not necessary. This happens upon mispredicted branches, cache misses, exceptions, etc. These correspond to circuits that get activated, spend energy, only to throw away their results because the instruction's effects must be discarded. In contrast, in a dataflow processor, each instruction indicates explicitly which other instruction(s) will produce its input. Or conversely, which other instruction(s) get activated as the result of one instruction completing execution. This way, instructions only enter the pipeline when their operands are ready, and speculation never occurs. So there is no more energy spent than strictly necessary to do the work (instructions). Now, the reason why we use the former forms of speculation is that it is the only way to make the pipeline fast if there is no information in the instruction stream (program) about the dataflow dependencies between instructions. Because it does not know better, the scheduler has to either: 1) try all instructions in program order, start do work as early as possible, and sometimes need to discard the work already started because an earlier instruction has decided a branch / fault / etc. or 2) rediscover the dataflow links by analyzing the instructions as they enter the processor, but then again the silicon logic to implement these tricks is also costing energy. The funny thing is, all compilers know about dataflow dependencies between instructions, but they throw the information away because the existing instruction sets cannot encode it. So really the situation should be simple: make new processors that support dataflow annotations, extend the compilers to encode this information (which they already have anyways), and off we go. However as others have highlighted making new instruction sets is like a "moonshot" because you have to involve a lot of people: compiler implementers, but also OS devs and everyone who will need to port their code to the new ISA. Besides, dataflow processors have a gorilla in the kitchen too. In a "pure" dataflow scheduler, all the instruction order is destroyed and as a result, cache locality is broken. So the flip side of the coin becomes 1) bad memory performance 2) extra energy expenditure on the memory system to deal with cache misses. Now there are ways to get the best of both worlds. One is to destroy the ordering of instructions only partially, by only applying dataflow scheduling on a window (eg the next 20 instructions). This is more or less what modern out-of-order processors do, although they still waste energy re-discovering dataflow links at run-time. The other technique is where many of us are going right now: use multiple hardware threads interleaved; keep the instruction order within threads to exploit whichever locality is encoded by the programmer, and apply dataflow scheduling techniques across instructions from separate threads, ie exploit maximum instruction concurrency between independent threads. Sun/Oracle started it with Niagara, now ARM is going there too. This approach really works very well in terms of operations / watt, however it requires software to use threads in the first place and not much software does that (yet). Also there is still a lot of ongoing research. |
For many applications, pure power consumption isn't even the best metric anymore. Due to advances in on-chip power and clock distribution, the energy x delay product and overall silicon efficiency have gained more importance in past years.
Obviously, dataflow processors excel in these metrics. And VLIW processors fall behind, which IMHO is the primary reason for their demise.
I agree with you that a practical dataflow architecture needs to be hierarchical. Not just for cache locality, but to reduce wiring overhead and debugging complexity, too.