| > AMD implements some important scalar instruction set extensions as microcode, not in silicon Do you have any examples other than pdep and pext? Although these happen to be my two favorite scalar instructions, I would hesitate to call them important. Compilers won't just generate these from normal source [1], and I would call their use extremely niche at the moment (things like chess engines, I'm looking at you). They aren't even available on Intel Ivy Bridge and Sandy Bridge machines, which still make up a big enough fraction of data center machines. So I'm pretty sure the number of entities avoiding switching to AMD because of heavy pdep and pext use is pretty close to zero. Maybe you have some other instructions in mind though? > Highly optimized/efficient code for Intel microarchitectures become a lot less so on the significantly different AMD microarchitecture. This was somewhat true in the past, and probably hit its peak in the P4 vs Athlon/Opteron era. However, it is pretty much incorrect for Zen. Although the details of the hardware implementation might differ (and unless you are an insider you can mostly only guess at this), as an optimization target for software, Zen is very similar. It has a similar width, similar cache design both for data and instructions, similar instruction latencies and throughput, and so on. In fact something like Zen is as similar to Haswell as Haswell is to say Ivy Bridge. The primary exception is AVX/AVX2 code, where Zen implements everything internally as 128-bit operations. In this area you might make some different decisions if targeting Zen - but the gap is not huge. --- [1] What I mean is they won't generate them any scenario other than directly calling the x86-specific builtin/intrinsic for that exact instruction. |
I lived in the HPC world prior to the existence of these instructions. I wouldn't want to go back. I used to design insanely complex and inscrutable bit-twiddling libraries to achieve the result of what is a handful of instructions now. It is one of the very few intrinsics I can't live without for most of the high-performance codes I write. The only other non-standard instructions with similar value are the AES intrinsics (which are useful for more than encryption).
Vector instruction support is important but more spotty in its value, at least in my case. I have applications where I expect the details of vector performance will matter but I have insufficient data thus far. Early AVX implementations were marginal but I could see use cases for AVX-512, though I have no anecdotal data to support that conjecture.