|
|
|
|
|
by dzaima
677 days ago
|
|
I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that? 6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such. 8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse). Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better. [1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h... |
|
Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).
There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.
Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).
In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.