|
|
|
|
|
by ribit
747 days ago
|
|
I don't see how this would work out beneficially. Let's say your hardware can join 4x128b units as a virtual 512-bit SVE SIMD unit. This means you have to advertise VL as 512bit for reasons of consistency. Yes, you will save some entries in the reorder buffer if you encounter a single SVE instruction, but if the code contains independent SVE streams, you will be stalled. Moreso, not all operations will utilize all 512 register bits, so your occupancy might suffer. The only scenario I see this feature working out is if you are decode or reorder buffer limited. Neither is a problem for modern high-performance ARM cores. With x86, it might be a different story. From what I understand, AVX512 instructions can be quite large. Modern out-of-order cores are already good at superscalar execution, so why not let them do their job? 4x128b units give you much more flexibility and better execution granularity. |
|
That aside, see the "cmp" sibling thread for a major (4x penalty) downside to 4x128.