| I downloaded the spec and took a look. TLDR: it's pretty much a traditional short-register SIMD, but with the addition of predication, including handling the tail of random-length loops using the vector processing body (as in RISC-V and Cray), not an extra scalar loop as previously needed. - provides 8 "Q" vector registers, always exactly 128 bits each - overlays the FP register file (32 "S" registers of 32 bits each / 16
"D" registers of 64 bits each) - MVE-I (8/16/32 bit integer) and MVE-F (16 and 32 bit FP) subsets. - architecturally defined to execute each vector instruction in 4 beats - 1, 2, or 4 beats per "architecture tick", and can vary during
execution. An "architecture tick" might or might not be 1 clock cycle. - two forms of predication, each with its own mask: "loop tail
predication", which is like RISC-V/Cray "vl" (but described as a mask), and "VPT predication" for data-dependent conditions. The two masks are ANDed together. - A VPT block is defined as the n instructions following a VPT or
VPST instruction, where n <= 4 - can be predicated with the condition or the inverse of the
condition. Similar to the existing If/Then/Else predicated execution. e.g. VPT, VPTT, VPTE, VPTTE, VPTEE, VPTEEE variants. - "VPT can be considered as the vectorized combination of CMP and IT" - predication is per-byte regardless of the element size. - loads set predicated-off bytes to 0, other instructions leave them untouched - VLD2/VLD4 and VST2/VST4 are provided for
interleaving/deinterleaving. Each instruction always loads/stores
exactly 128 bits to/from 2 or 4 consecutive Q registers. - there is also scatter/gather - there are some fancy operations. e.g. VCADD: Vector Complex Add with
Rotate. This instruction performs a complex addition of the first
operand with the second operand rotated in the complex plane by the
specified amount, either 90 or 270 degrees. Also VCMLA: Vector Complex
Multiply Accumulate. |