| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by axman6 4468 days ago

How do you get ILP from this? Seems like it would make scheduling much more difficult because you now need to make sure that that the results you need for each instruction are in the order in the belt you need, and you have to be able to execute any order of instruction types. The mill can run fast partly because and instruction can use any result on the belt, and similar instructions are grouped together making decode simpler (from memory, they have two separate decoders and each instruction has all say arith instructions grouped and decoded by one decode, and all the others decoded by the other).

Basically, I don't see how you can use what you suggest to do, in one instruction:

    [ add positions 7 and 5 | multiply positions 7 and 2 | call f on 4  and 5 | branch to foo if position 3 was LT else bar ]

ending up with the belt

    [ [7]+[5], [7]*[2], f([4],[5]) ... ]

or whatever you like. All you need to do to schedule the mill is to perform as many operations in parallel as the hardware can do, and then find out where their results would be placed to create the next instruction.

1 comments

panduwana 4467 days ago

In the linked article it is said that "According to prior research, only some 13% of values are used more than once". So based on the research on which Mill is based on, your example and that "instruction can use any result on the belt" is actually minority case.

As for scheduling my proposed store-addressed belt: you perform as many operations in parallel, then for each operation find the other operation that depends on the former's operation result, calculate the distance between them and assign it as the former's store address. The compiler has more work to do yes, but not "much more difficult".

baking 4467 days ago

Offhand, I would say maybe one difference is that your model is trying to predict where the belt will be in the future while the Mill is looking backwards to find where the belt was in the past.

Another issue is that you would have to process the entire instruction in order to know where each operation gets its input. (How many operations in the instruction are taking things off the belt before I get my data?) In the Mill the operations are parsed in parallel and they have all the information they need to start processing as soon as the the instruction (block) is loaded in the buffer.

The size of the belt is a very finely tuned constraint (using simulations) that basically depends on how many cycles you have to save a value to the scratchpad memory (if needed) before it "drops off" the belt. There is a lecture that describes why it takes the number of cycles it does and if you watch it you will probably understand better why the Mill is not about what is easy or hard for the compiler but all about getting the silicon to jump through hoops fast and efficiently.