|
|
|
|
|
by avianes
1433 days ago
|
|
I would be curious to know which RISC-V V implementation the author is talking about. > If you imagine how a physical CPU or GPU has to be constructed in order to do large multi-input operations (...) You can imagine these inputs as being in "lanes" that are arranged across the chip such that the inputs to each lane are stored near the lane. This is not how a GPU register bank works at all.
GPU register file are SRAM banks and operand collector are used to handle register-read latency. And there is a big cross-bar between the register banks and the operands collectors. And for CPU Vector unit and SIMD unit, I only know two implementations (the CVA6 ARA vector unit and an industrial closed source one) but neither of them do registers storage within/near the lane. Author's assumption on microarchitecture seems questionable to me. PS: The ARA RISC-V V implementation used a mask unit to handle mask. Which makes the mentioned problem irrelevant |
|
However to me it seems that's completely orthogonal to the vector lanes : I don't see why two parallel lanes in a single thread (eg a 64-element GCN wavefront) would need cross-connected logic at the register file, since almost all instructions _do not_ read/write data from another lane.
There are a few cross-lane shuffles / reduce instruction but it seems to me that those would be handled in a dedicated execution unit. (they are not really the fast-path/common case)