How timely that The Adapteva Parallela computer board was just made available as well. Perhaps something interesting can happen at the confluence of that hardware and these compiler thoughts.
This is where it needs to happen. The problem is that co-designing a new hardware/software stack from scratch is the computer science equivalent of a moonshot.
The word "moonshot" is close to the truth, yet there are strategies to overcome the adoption threshold. That was precisely the topic of my PhD dissertation, and I concur that the Adapteva guys have chosen a sound approach.
But Epiphany is not a dataflow architecture, and suffers from execution efficiency problems in individual cores. Dataflow is really the way to go to lower energy consumption dramatically.
Funny this topic comes up. It just happens the EU has recently invested in such a project to co-design a dataflow processor and software stack. It was north of a million euros for the initial proof-of-concept research, and the results are slowly starting to trickle through.
Minimum energy usage is very dependent on not activating more circuits than strictly required for a given computation.
However a conventional processor pipeline will usually fetch instructions and begin processing them, only to realize later on that they were not necessary. This happens upon mispredicted branches, cache misses, exceptions, etc. These correspond to circuits that get activated, spend energy, only to throw away their results because the instruction's effects must be discarded.
In contrast, in a dataflow processor, each instruction indicates explicitly which other instruction(s) will produce its input. Or conversely, which other instruction(s) get activated as the result of one instruction completing execution. This way, instructions only enter the pipeline when their operands are ready, and speculation never occurs. So there is no more energy spent than strictly necessary to do the work (instructions).
Now, the reason why we use the former forms of speculation is that it is the only way to make the pipeline fast if there is no information in the instruction stream (program) about the dataflow dependencies between instructions. Because it does not know better, the scheduler has to either: 1) try all instructions in program order, start do work as early as possible, and sometimes need to discard the work already started because an earlier instruction has decided a branch / fault / etc. or 2) rediscover the dataflow links by analyzing the instructions as they enter the processor, but then again the silicon logic to implement these tricks is also costing energy.
The funny thing is, all compilers know about dataflow dependencies between instructions, but they throw the information away because the existing instruction sets cannot encode it.
So really the situation should be simple: make new processors that support dataflow annotations, extend the compilers to encode this information (which they already have anyways), and off we go.
However as others have highlighted making new instruction sets is like a "moonshot" because you have to involve a lot of people: compiler implementers, but also OS devs and everyone who will need to port their code to the new ISA.
Besides, dataflow processors have a gorilla in the kitchen too. In a "pure" dataflow scheduler, all the instruction order is destroyed and as a result, cache locality is broken. So the flip side of the coin becomes 1) bad memory performance 2) extra energy expenditure on the memory system to deal with cache misses.
Now there are ways to get the best of both worlds.
One is to destroy the ordering of instructions only partially, by only applying dataflow scheduling on a window (eg the next 20 instructions). This is more or less what modern out-of-order processors do, although they still waste energy re-discovering dataflow links at run-time.
The other technique is where many of us are going right now: use multiple hardware threads interleaved; keep the instruction order within threads to exploit whichever locality is encoded by the programmer, and apply dataflow scheduling techniques across instructions from separate threads, ie exploit maximum instruction concurrency between independent threads. Sun/Oracle started it with Niagara, now ARM is going there too. This approach really works very well in terms of operations / watt, however it requires software to use threads in the first place and not much software does that (yet).
For many applications, pure power consumption isn't even the best metric anymore. Due to advances in on-chip power and clock distribution, the energy x delay product and overall silicon efficiency have gained more importance in past years.
Obviously, dataflow processors excel in these metrics. And VLIW processors fall behind, which IMHO is the primary reason for their demise.
I agree with you that a practical dataflow architecture needs to be hierarchical. Not just for cache locality, but to reduce wiring overhead and debugging complexity, too.
If compilers already have the dependency information and could provide it in instruction annotations, then why hasn't Intel done anything with this? Intel has its own C/C++ compiler, so it could extend the x86 instruction set with new instructions that contain the necessary annotations, and add support for these annotations in its compiler.
Except that would not show lower energy usage. To preserve backward compatibility, the cores would need to keep the logic that analyses data dependencies to continue delivering good performance to legacy code. To make any difference they would need to both do what you say, and also define some protocol to instructs the processor to disable the dataflow analysis unit entirely (to save energy). But that protocol would be invasive, because you need to re-activate the unit at the first instruction that is not annotated, and upon faults, branches, etc. The logic to coordinate this protocol becomes a new energy expenditure on its own!
Really the way forward would be to extend x86 completely, with a "mode" where all instructions are annotated and go through a different pipeline front-end than legacy x86 code. But Intel already tried that with IA64, and it burned them very hard. I am not sure they are willing to do it again.
In your research, are you using a new programming language to take advantage of dataflow scheduling techniques, or are you working with one or more existing languages? If the latter, do you have any data or opinions on which languages or language features are most amenable to an effective dataflow-based architecture?
We use just C extensions for now, very close to what Cilk does.
1) What is really important is to realize that dataflow variables (I-structures) are not in memory. So any language/library that gives dataflow semantics to programmers should not allow programmers to indirect through memory to get to the I-structures. This is the main requirement for an efficient projection to a hardware dataflow scheduler.
In practice, things like Occam, SISAL and most pure functional programming languages are OK-ish.
2) any language should allow an (advanced) programmer (or compiler) to annotate the instructions to also suggest some ordering not related to data dependencies. As I explained before dataflow scheduling tends to destroy order and break locality, and for some applications this is very bad. Unfortunately all existing dataflow-ish languages (incl most functional languages) were designed with the outdated vision that all memory accesses have the same cost. We now know this is not true any more.
Other than using threads (as I explained before) a well-known theoretical way forward is to introduce optional control flow edges between instructions using "ghost" data dependencies, which impact scheduling but do not allocate registers/I-vars. However I am not aware of languages where this is possible.
In the case of Adapteva, I think it's helped along because their starting point is fairly standard. Using Arm, Ansi C, Ubuntu, etc. They have an fpga on their board, and then their specialized parallel processing chip. They support OpenCL. I'm not a compiler guy, but it seems to me that various concepts can be injected at different levels of this hardware stack, slowly but surely, without having to create the entire thing from scratch.