| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by avianes 1480 days ago

I would be curious to know which RISC-V V implementation the author is talking about.

> If you imagine how a physical CPU or GPU has to be constructed in order to do large multi-input operations (...) You can imagine these inputs as being in "lanes" that are arranged across the chip such that the inputs to each lane are stored near the lane.

This is not how a GPU register bank works at all. GPU register file are SRAM banks and operand collector are used to handle register-read latency. And there is a big cross-bar between the register banks and the operands collectors.

And for CPU Vector unit and SIMD unit, I only know two implementations (the CVA6 ARA vector unit and an industrial closed source one) but neither of them do registers storage within/near the lane.

Author's assumption on microarchitecture seems questionable to me.

PS: The ARA RISC-V V implementation used a mask unit to handle mask. Which makes the mentioned problem irrelevant

1 comments

obl 1480 days ago

Wait, I'm pretty sure operand collection logic & banking is there to keep the number of ports on the SRAM low, so basically you're arbitrating and buffering requests coming from high register count instructions (say 3 input fma) and potentially multiple pipelined SMT threads (not thread in the nvidia sense, thread in the "a whole wavefront/warp" sense).

However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.

There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

link

avianes 1480 days ago

> There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)

Yes, you essentially need a (kind of) crossbar for shuffle and value broadcast. But as far as I know there is no unit dedicated to this on Nvidia GPU. However, depending on the GPU microarchitecture, shuffle and broadcast may be implemented differently (e.g. through the load/store units).

Note that I said "crossbar" for simplicity and because there is little information available, I doubt that all the paths really exist

link