| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dzaima 677 days ago

I know the E-cores (gracemont, crestmont, skymont) have the multi-decoder setup; the first couple search results don't show Golden Cove being the same. Do you have some reference for that?

6. Ah yeah the funky SSE4a thing. RISC-V has its own similar but worse thing with RVV0.7.1 / xtheadvector already though, and it can be basically guaranteed that there will be tons of one-off vendor extensions, including vector ones, given that anyone can make such.

8. RVV's vrgather is extremely bad at this, but is very important for a bunch of non-trivial things; existing RVV1.0 hardware has it at O(LMUL^2), e.g. BPI-F3 takes 256 cycles for LMUL=8[1]. But some hypothetical future hardware could do it at O(LMUL) for non-worst-case indices, thus massively changing tradeoffs. So far the compiler approaches are to just not do high LMUL when vrgather is needed (potentially leaving free perf on the table), or using indexed loads (potentially significantly worse).

Whereas x86 and ARM SIMD perf variance is very tiny; basically everything is pretty proportional everywhere, with maybe the exception of very old atom cores. There'll be some differences of 2x up or down of throughput of instruction classes, but it's generally not so bad as to make way for alternative approaches to be better.

[1]: https://camel-cdr.github.io/rvv-bench-results/bpi_f3/index.h...

2 comments

hajile 677 days ago

I think you may be correct about gracemont v golden cove. Rumors/insiders say that Intel has supposedly decided to kill off either the P or E-core team, so I'd guess that the P-core team is getting layed off because the E-core IPC is basically the same, but the E-core is massively more efficient. Even if the P-core wins, I'd expect them to adopt the 3x3 decoder just as AMD adopted a 2x4 decoder for zen5.

Using a non-frozen spec is at your own risk. There's nothing comparable to stuff like SSE4a or FMA4. The custom extension issue is vastly overstated. Anybody can make extensions, but nobody will use unratified extensions unless you are in a very niche industry. The P extension is a good example here. The current proposal is a copy/paste of a proprietary extension a company is using. There may be people in their niche using their extension, but I don't see people jumping to add support anywhere (outside their own engineers).

There's a LOT to unpack about RVV. Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Differing performance across different performance levels is to be expected when RVV must scale from tiny DSPs up to supercomputers. As you point out, old atom cores (about the same as the Spacemit CPU) would have a different performance profile from a larger core. Even larger AMD cores have different performance characteristics with their tendency to like double-pumping AVX2/512 instructions (but not all of them -- just some).

In any case, it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction (and the wrong configuration at times). It seems obvious to me that the compiler will ultimately need to generate a handful of different code variants (shouldn't be a code bloat issue because only a tiny fraction of all code is SIMD) the dynamically choose the best variant for the processor at runtime.

link

dzaima 677 days ago

> Packed SIMD doesn't even have LMUL>1, so the comparison here is that you are usually the same as Packed SIMD, but can sometimes be better which isn't a terrible place to be.

Packed SIMD not having LMUL means that hardware can't rely on it being used for high performance; whereas some of the theadvector hardware (which could equally apply to rvv1.0) already had VLEN=128 with 256-bit ALUs, thus having LMUL=2 have twice the throughput of LMUL=1. And even above LMUL=2 various benchmarks have shown improvements.

Having a compiler output multiple versions is an interesting idea. Pretty sure it won't happen though; it'd be a rather difficult political mess of more and more "please add special-casing of my hardware", and would have the problem of it ceasing to reasonably function on hardware released after being compiled (unless like glibc or something gets some standard set of hardware performance properties that can be updated independently of precompiled software, which'd be extra hard to get through). Also P-cores vs E-cores would add an extra layer of mess. There might be some simpler version of just going by VLEN, which is always constant, but I don't see much use in that really.

link

janwas 676 days ago

> it's a matter of the wrong configuration unlike x86 where it is a matter of the wrong instruction

+1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

I agree with your point that multiple code variants + runtime dispatch are helpful. We do this with Highway in particular for x86. Users only write code once with portable intrinsics, and the mess of instruction selection is taken care of.

link

camel-cdr 676 days ago

> +1 to dzaima's mention of vrgather. The lack of fixed-pattern shuffle instructions in RVV is absolutely a wrong-instruction issue.

What others would you want? Something like vzip1/2 would make sense, but that isn't much of an permutation, since the input elements are exctly next to the output elements.

link

janwas 676 days ago

Going through Highway's set of shuffle ops:

64-bit OddEven/Reverse2/ConcatOdd/ConcatEven, OddEvenBlocks, SwapAdjacentBlocks, 8-bit Reverse, CombineShiftRightBytes, TableLookupBytesOr0 (=PSHUFB) and Broadcast especially for 8-bit, TwoTablesLookupLanes, InsertBlock, InterleaveLower/InterleaveUpper (=vzip1/2).

All of these are considerably more expensive on RVV. SVE has a nice set, despite also being VL-agnostic.

link

dzaima 676 days ago

More RVV questionable optimization cases:

- broadcasting a loaded value: a stride-0 load can be used for this, and could be faster than going through a GPR load & vmv.v.x, but could also be much slower.

- reversing: could use vrgather (could do high LMUL everywhere and split into multiple LMUL=1 vrgathers), could use a stride -1 load or store.

- early-exit loops: It's feasible to vectorize such, even with loads via fault-only-first. But if vl=vlmax is used for it, it might end up doing a ton of unnecessary computation, esp. on high-VLEN hardware. Though there's the "fun" solution of hardware intentionally lowering vl on fault-onlt-first to what it considers reasonable as there aren't strict requirements for it.

link