| I agree; and the article seems to have also quite a few technical flaws: - Register width: we somewhat maxed out at 512 bits, with Intel going back to 256 bits for non-server CPUs. I don't see larger widths on the horizon (even if SVE theoretically supports up to 2048 bits, I don't know any implementation with ~~>256~~ >512 bits). Larger bit widths are not beneficial for most applications and the few applications that are (e.g., some HPC codes) are nowadays served by GPUs. - The post mentions available opcode space: while opcode space is limited, a reasonably well-designed ISA (e.g., AArch64) has enough holes for extensions. Adding new instructions doesn't require ABI changes, and while adding new registers requires some kernel changes, this is well understood at this point. - "What is worse, software developers often have to target several SIMD generations" -- no way around this, though, unless auto-vectorization becomes substantially better. Adjusting the register width is not the big problem when porting code, making better use of available instructions is. - "The packed SIMD paradigm is that there is a 1:1 mapping between the register width and the execution unit width" -- no. E.g., AMD's Zen 4 does double pumping, and AVX was IIRC originally designed to support this as well (although Intel went directly for 256-bit units). - "At the same time many SIMD operations are pipelined and require several clock cycles to complete" -- well, they are pipelined, but many SIMD instructions have the same latency as their scalar counterpart. - "Consequently, loops have to be unrolled in order to avoid stalls and keep the pipeline busy." -- loop unroll has several benefits, mostly to reduce the overhead of the loop and to avoid data dependencies between loop iterations. Larger basic blocks are better for hardware as every branch, even if predicted correctly, has a small penalty. "Loop unrolling also increases register pressure" -- it does, but code that really requires >32 registers is extremely rare, so a good instruction scheduler in the compiler can avoid spilling. In my experience, dynamic vector sizes make code slower, because they inhibit optimizations. E.g., spilling a dynamically sized vector is like a dynamic stack allocation with a dynamic offset. I don't think SVE delivered any large benefits, both in terms of performance (there's not much hardware with SVE to begin with...) and compiler support. RISC-V pushes further into this direction, we'll see how this turns out. |
Which still means you have to write your code at least thrice, which is two times more than with a variable length SIMD ISA.
Also there are processors with larger vector length, e.g. 1024-bit: Andes AX45MPV, SiFive X380, 2048-bit: Akeana 1200, 16384-bit: NEC SX-Aurora, Ara, EPI
> no way around this
You rarely need to rewrite SIMD code to take advantage of new extensions, unless somebody decides to create a new one with a larger SIMD width. This mostly happens when very specialized instructions are added.
> In my experience, dynamic vector sizes make code slower, because they inhibit optimizations.
Do you have more examples of this?
I don't see spilling as much of a problem, because you want to avoid it regardless, and codegen for dynamic vector sizes is pretty good in my experience.
> I don't think SVE delivered any large benefits
Well, all Arm CPUs except for the A64FX were build to execute NEON as fast as possible. X86 CPUs aren't built to execute MMX or SSE or the latest, even AVX, as fast as possible.
Anyway, I know of one comparison between NEON and SVE: https://solidpixel.github.io/astcenc_meets_sve
> Performance was a lot better than I expected, giving between 14 and 63% uplift. Larger block sizes benefitted the most, as we get higher utilization of the wider vectors and fewer idle lanes.
> I found the scale of the uplift somewhat surprising as Neoverse V1 allows 4-wide NEON issue, or 2-wide SVE issue, so in terms of data-width the two should work out very similar.