| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ribit 777 days ago

In an operand-collector architecture the threads are still executed in lockstep. I don't think this makes the basic architecture less "SIMD-y". Operand collectors are a smart way to avoid multi-ported register files, which enables more compact implementation. Different vendors use different approaches to achieve a similar result. Nvidia uses operand collectors, Apple uses explicit cache control flags etc.

> This enable to read from the register-file in an asynchronous fashion (by "asynchronous" here I mean not all at the same cycle) without introducing any stall.

You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet. The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

> That why (or 1 of the reason) you need to sync your threads in the SIMT programing model and not in an SIMD programming model.

It is my understanding that you need to synchronize threads when accessing shared memory. Not only different threads can execute on different SIMD, but also threads on the same SIMD can access shared memory over multiple cycles on some architectures. I do not see how thread synthconization relates to operand collectors.

1 comments

avianes 777 days ago

> In an operand-collector architecture the threads are still executed in lockstep. > [...] > It is my understanding that you need to synchronize threads when accessing shared memory.

Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

And yes, it's hard to expose de-synchronization without memory operations, so you only need sync for memory operation. (load/store unit also have operand-collector)

> You can still get stalls if an EU is available in a given cycle but not all operands have been collected yet

That's true, but you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

> The way I understand the published patents is that operand collectors are a data gateway to the SIMD units. The instructions are alraedy scheduled at this point and the job of the collector is to sgnal whether the data is ready. Do modern Nvidia implementations actually reorder instructions based feedback from operand collectors?

Calling UE "SIMD unit" in an SIMT uarch add a lot of ambiguity, so I'm not sure a understand you point correctly. But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

link

ribit 777 days ago

> Not sure what you mean by lockstep here. When an operand-collector entry is ready it dispatch it to execute as soon as possible (write arbitration aside) even if other operand-collector entries from the same warp are not ready yet (so not really what a would call "threads lock-step"). But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Hm, the way I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep". In my understanding the role of the operand collector was to make sure that all register arguments are available before the instruction starts executing. If the operand collector needs multiple cycles to fetch the arguments from the register file, the instruction execution would stall.

So you are saying that my understanding is incorrect and that the instruction can be executed in multiple passes with different masks depending on which arguments are available? What is the benefit as opposed to stalling and executing the instruction only when all arguments are available? To me it seems like the end result is the same, and stalling is simpler and probably more energy efficient (if EUs are power-gated).

> But, yes (warp) instruction is already scheduled, but (ALU) operation are re-scheduled by the operand-collector and it's dispatch. In the Nvidia patent they mention the possibility to dispatch operation in an order that prevent write collision for example.

Ah, that is interesting, so the operand collector provides a limited reordering capability to maximize hardware utilization, right? I must have missed that bit in the patent, that is a very smart idea.

> But it's possible that Nvidia enforce that all threads from a warp should complete before sending the next warp instruction (I would call it something like "instruction lock-step"). This can simplify data dependency hazard check. But that an implementation detail, it's not required by the SIMT scheme.

Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)? Many GPUs claim dual-issue capability, but that either refers to interleaved execution from different programs (Nvidia, Apple) or a SIMD-within SIMT or maybe even a form of long instruction word (AMD). If I remember correctly, Nvidia instructions contain some scheduling information that tells the scheduler when it is safe to issue the next instruction from the same wave after the previous one started execution. I don't know how others do it, probably via some static instruction timing information. Apple does have a very recent patent describing dependency detection in an in-order processor, no idea whether it is intended for the GPU or something else.

> you have multiple multiple operand-collector entry to minimize the probability that no entry is ready. I should have say "to minimize bubbles".

I think this is essentially what some architectures describe as the "register file cache". What is nice about Nvidia's approach is that it seems to be fully automatic and can really make the best use of a constrained register file.

link

avianes 777 days ago

> I understood it is that a single instruction is executed on a 16-wide SIMD unit, thus processing 16 elements/threads/lanes simultaneously (subject to execution mask of course). This is what I mean by "in lockstep".

Ok I see, that definitely not what I understood from my study of the Nvidia SIMT uarch. And yes I will claim that "the instruction can be executed in multiple passes with different masks depending on which arguments are available" (using your words).

> So the operand collector provides a limited reordering capability to maximize hardware utilization, right?

Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

> What is the benefit as opposed to stalling and executing the instruction only when all arguments are available?

That's a good question, note that: I think Apple GPU uarch do not work like the Nvidia one, my understanding is that Apple uarch is way closer to a classical SIMD unit. So it's definitely not killer to move form the original SIMT uarch from Nvidia.

That said, a think the SIMT uarch from Nvidia is way more flexible, and better maximize hardware utilization (executing instruction as soon as possible always help for better utilization). And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core. With a classical SIMD uarch it may be possible but you need extra hardware to handle warp execution overlapping, and even more hardware to enable overlapping more that 2 threads.

Also, Nvidia's operand-collector allow to emulate multi-ported register-file, this probably help with register sharing. There is actually multiple patent from Nvidia about non-trivial register allocation within the register-file banks, depending on how the register will be used to minimize conflict.

> Is any existing GPU actually doing superscalar execution from the same software thread (I mean the program thread, i.e., warp, not a SIMT thread)?

It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread. With SIMT most of the instruction parallelism is very explicit (with thread parallelism), so it's not really "extracted" (and not from the same thread). But anyway, if you question is either multiple instructions from a single warp can be executed in parallel (across different threads), then a would say probably yes for Nvidia (not sure, there is very few information available..), at least 2 instructions from the same thread block (from the same program, but different warp) should be able to be executed in parallel.

> I think this is essentially what some architectures describe as the "register file cache"

I'm not sure about that, there is actually some published papers (and probably some patents) from Nvidia about register-file cache for SIMT uarch. And that come after the operand-collector patent. But in the end it really depend what concept you are referring to with "register-file cache".

In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

link

ribit 776 days ago

> Yes, that my understanding, and that's why I claim it's different from "classical" SIMD

I understand, yes, it makes sense. Of course, other architectures can make other optimizations, like selecting warps that are more likely to have data ready etc., but Nvidia's implementation does sound like a very smart approach

> And let say you have 2 warps with complementary masking, with the Nvidia's SIMT uarch it goes naturally to issue both warps simultaneously and they can be executed at the same cycle within different ALU/core

That is indeed a powerful technique

> It's not obvious what would mean "superscalar" in an SIMT context. For me a superscalar core is a core that can extract instruction parallelism from a sequential code (associated to a single thread) and therefore dispatch/issue/execute more that 1 instruction per cycle per thread.

Yes, I meant executing multiple instructions from the same warp/thread concurrently, depending on the execution granularity of course. Executing instructions from different warps in the same block is slightly different, since warps don't need to be at the same execution state. Applying the CPU terminology, warp is more like a "CPU thread". It does seem like Nvidia indeed moved quite far into the SIMT direction and their threads/lanes can have independent program state. So I thin I can see the validity of your arguments that Nvidia can remap SIMD ALUs on the fly to suitable threads in order to achieve high hardware utilization.

> In the Nvidia case a "register-file cache" is a cache placed between the register-file and the operand-collector. And it makes sense in their case since the register-file have variable latency (depending on collision) and because it will save SRAM read power.

Got it, thanks!

P.S. By the way, wanted to thank you for this very interesting conversation. I learned a lot.

link