| There's some very interesting flags at the lowest level of NVidia SASS (machine code) that the PTX compiler adds. https://arxiv.org/pdf/1804.06826.pdf PDF Page 14 (physical page 12) shows that every instruction on NVidia Volta SASS has 4-bits of "Reuse flags", 6-bits of "wait barrier mask", 3-bits of "Read Barrier Index", 3-bits of "Write Barrier Index", 1-bit "Yield Flag" and 4-bits "Stall Cycles". > Wait barrier mask; Read/Write barrier index. While most instructions
have fixed latency and can be statically scheduled by the assembler, instructions involving memory and shared resources typically have variable latency.
Volta, Pascal and Maxwell use dependency barriers to track the completion
of variable-latency instructions and resolve data hazards. When a variablelatency instruction writes to a register, the assembler associates it to one of
the 6 available barriers by setting the corresponding write barrier number field.
When a later instruction consumes that register, the assembler marks the instruction as waiting on that barrier by setting the bit corresponding to that
barrier in the wait barrier mask. The hardware will stall the later instruction
until the results of the earlier one are available. An instruction may wait on
multiple barriers, which explains why the wait barrier mask is a bitmask, not
an index. -------- It seems like NVidia has invented something very interesting here. I don't quite understand it myself, but its quite possibly what you're talking about. __EVERY__ instruction, on NVidia machines Volta and newer (and one control-every 3 instructions on Pascal, and 7 on older Kepler systems). So these "bundles" could be seen as "better-Itanium" back in Kepler/Pascal, but perhaps the modern NVidia GPU core is fast enough to have such pre-compiled dynamic behavior / barrier interpretation every instruction these days. It seems like a lot of instruction-bandwidth to eat up though, since each instruction on Volta is 128-bits / 16-bytes long since there's so much control information |