Hacker News new | ask | show | jobs
by jandrewrogers 2749 days ago
There are a couple more reasons, all related to loss of optimization that can offset any nominal price-performance gains. Counterintuitively, people who are the most sensitive to performance often have the most to lose.

AMD implements some important scalar instruction set extensions as microcode, not in silicon, so if you have an application that uses them heavily (and some of these instructions are significant optimizations over generic C code) you will see a drop-off in performance.

Highly optimized/efficient code for Intel microarchitectures become a lot less so on the significantly different AMD microarchitecture. The effects are not small and re-optimizing for a different microarchitecture can be a lot of work depending on the application.

1 comments

> AMD implements some important scalar instruction set extensions as microcode, not in silicon

Do you have any examples other than pdep and pext? Although these happen to be my two favorite scalar instructions, I would hesitate to call them important. Compilers won't just generate these from normal source [1], and I would call their use extremely niche at the moment (things like chess engines, I'm looking at you). They aren't even available on Intel Ivy Bridge and Sandy Bridge machines, which still make up a big enough fraction of data center machines.

So I'm pretty sure the number of entities avoiding switching to AMD because of heavy pdep and pext use is pretty close to zero.

Maybe you have some other instructions in mind though?

> Highly optimized/efficient code for Intel microarchitectures become a lot less so on the significantly different AMD microarchitecture.

This was somewhat true in the past, and probably hit its peak in the P4 vs Athlon/Opteron era. However, it is pretty much incorrect for Zen. Although the details of the hardware implementation might differ (and unless you are an insider you can mostly only guess at this), as an optimization target for software, Zen is very similar. It has a similar width, similar cache design both for data and instructions, similar instruction latencies and throughput, and so on. In fact something like Zen is as similar to Haswell as Haswell is to say Ivy Bridge.

The primary exception is AVX/AVX2 code, where Zen implements everything internally as 128-bit operations. In this area you might make some different decisions if targeting Zen - but the gap is not huge.

---

[1] What I mean is they won't generate them any scenario other than directly calling the x86-specific builtin/intrinsic for that exact instruction.

PDEP/PEXT are the big ones, they are extremely important for real-time sensor and event processing (plus a few other things like join parallelization). Those instructions let you trivially compute ad hoc intersections between arbitrary and mixed dimensionality constraints in high dimensionality spaces that would lead to some very ugly and much slower code in pure C++. Also useful for massively parallel graph analytics. Ironically, the instructions were not designed for this purpose. We are talking about a 10x improvement in throughput, it isn't trivial.

I lived in the HPC world prior to the existence of these instructions. I wouldn't want to go back. I used to design insanely complex and inscrutable bit-twiddling libraries to achieve the result of what is a handful of instructions now. It is one of the very few intrinsics I can't live without for most of the high-performance codes I write. The only other non-standard instructions with similar value are the AES intrinsics (which are useful for more than encryption).

Vector instruction support is important but more spotty in its value, at least in my case. I have applications where I expect the details of vector performance will matter but I have insufficient data thus far. Early AVX implementations were marginal but I could see use cases for AVX-512, though I have no anecdotal data to support that conjecture.

Thanks, that is really interesting. It is hard to believe that pdep/ext alone could result in a 10x throughput improvement - but I acknowledge it is possible since that is one very slow to emulate instruction in the general case, and if you needed exactly that...

It actually isn't clear to me exactly what Intel was targeting with that pair of instructions, but they sure is useful in all sorts of scenarios.

> The only other non-standard instructions with similar value are the AES intrinsics

If I can ask, what are the interesting uses outside of encryption? The main use I am aware of is as a handy fast and high-quality hash function implemented in hardware (and you don't need all the rounds when you are just after quality, and not adversarial collision resistance).

For PDEP/PEXT it is the general case of ad hoc and unpredictable bit extract/deposit sequences. A decade ago, I spent a lot of time designing clever libraries that could dynamically effect this but even if you could amortize the overhead of setting up the machinery, it still was ~20 cycles. These instructions eliminated the need to code gen at all, and each instruction runs a lot faster than ~20 cycles. When those instructions showed up with Haswell, it wiped out a lot of code I had written, and in a good way. You can compose them to effect algorithms that would be very complicated (and slow) to implement otherwise.

I've read some things from Intel that suggest PDEP/PEXT were designed for cryptographic applications. However, they are a straightforward implementation of generalized shift networks (there is literature on this), so their potential applications are much broader.

For AES, those instructions have interesting properties for integer manipulation beyond encryption, and even beyond providing the basis for the fastest generic non-cryptographic hash functions currently available for both large and small keys. For example, you can compute a perfect hash (e.g. collision-free hashing from 32-bits to 32-bits) in a few clock cycles for scalar primitives using AES intrinsics. If you understand the construction, which superficially seems like it should not be possible, the result is virtually ideal statistically. Brilliant for hash tables, which still spend a lot of their time hashing, so I am surprised no one seems to be doing it (I figured it out myself, studying the statistical peculiarities of the AES instructions).

> Do you have any examples other than pdep and pext?

They're my favorite instructions too, and I really hope Zen2 fixes their performance problems. But as you say: they're not really used much in practice. I can only point to Stockfish, which uses pdep / pext to calculate where bishops and/or rooks can move on 64-bit (8x8) chess boards.

Side note: Figuring out where bishops / rooks can move damn cool. https://www.chessprogramming.org/BMI2#PEXTBitboards

"occ" is an occupied square. Remember that in Chess, bishops and rooks are blocked by both allied and enemy pieces. EnumSquare is a value between [0 and 64) that represents where the Bishop (or rook) is located.

----------

The other instruction I came across that's microcode based was:

1. vgather -- I'm pretty sure Intel is microcode based as well however.

2. PCLMULQDQ -- Carryless Multiply, used for GCM mode encryption. Intel's allegedly has 1-clock-per-instruction bandwidth, while I've measured AMD's to be ~2 clocks per instruction, and AMD claims its microcode (it doesn't say which FP pipelines are used)

Neither are scalar code though.

------

Allegedly, Intel improved the integer-division instruction to ~20 clock cycles on the 9000-series, but that isn't implemented on servers yet. So I guess 64-bit division / 64-bit modulo is now a major advantage to Intel. But this is a very recent event and not widely deployed yet.

> The primary exception is AVX/AVX2 code, where Zen implements everything internally as 128-bit operations. In this area you might make some different decisions if targeting Zen - but the gap is not huge.

Even then, AVX2 code is more efficient to decode and run. So even if its emulated on AMD's platform, there are benefits to writing AVX2 code.

Remember that its not a pure win on Intel systems either: use of any YMM register begins to downclock the whole chip, since those registers draw significantly more power. There's also some vzeroupper issues (mostly used to avoid this downclocking problem).

In effect: you need to use AVX2 and AVX512 code with a degree of caution on Intel platforms. Its probably a win if you're reaching for the button, but for very small loops, the downclock may slow down the rest of your scalar code.

-------------

Otherwise, I think I agree with you fundamentally. Optimizing for Zen or Skylake is incredibly similar: use SIMD where possible and cut dependencies.

The Branch predictor is different, but I don't think anyone (aside from Meltdown / Spectre code) relies on the details of either branch predictor. The number of execution pipes are different, but the programmer's focus should remain on cutting dependencies and maximizing ILP, regardless of the number of execution pipes that exist.

> They're my favorite instructions too, and I really hope Zen2 fixes their performance problems.

Me too. AFAIK their slowness is probably due to requiring a specialized functional unit to implement. Something like the unit described in this paper [1].

> Allegedly, Intel improved the integer-division instruction to ~20 clock cycles on the 9000-series

Do you have a reference?

That would be weird if it applied only to the 9000 series, and not other Coffee Lake cores. After all, it's the same core, reportedly unchanged all the way back to Skylake [2], so how could the divider be faster?

FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).

> Even then, AVX2 code is more efficient to decode and run. So even if its emulated on AMD's platform, there are benefits to writing AVX2 code.

Yes, that's why I said you _might_ make _some_ different decisions, such as in any algorithm that doesn't scale cleanly to 256 bits, but still ends up faster when the CPU offers full 256 bit ALUs (so 256 bit and 128 bit ops have the same performance).

One real-world example would be something that uses a vector-width lookup table, say for a shuffle mask. With 2 possibilities for each DWORD element, a 128-bit shuffle mask only needs 16 entries, but 256-bit masks need 256 and they are twice as large (8 KiB in total!). With fast 256-bit units you might suck up this penalty, since it might end up faster overall, but with 128-bit units you might be better off going with the much smaller table and 128-bit lookups, at the same total throughput.

> Remember that its not a pure win on Intel systems either: use of any YMM register begins to downclock the whole chip,

Well not really anymore. Most (all?) recent chips don't downclock for use of 256-bit registers (not counting "high lane powerup"). Only some server chips downclock for "heavy" AVX2 use, which really means a lot of back-to-back FMAs or other heavy FP operations. In general the penalty for 256-bit instructions is small on recent cores (a larger penalty is paid for AVX-512), and compilers generally use them freely (the same is not true for 512-bit) and effectively.

---

[1] https://github.com/tpn/pdfs/blob/master/Fast%20Bit%20Compres...

[2] I think there must be some small changes, since the LSD was re-enabled, implying that they fixed the bug where registers could be corrupted when using the high half of the GP byte registers.

> Do you have a reference?

Yes and no. Apparently, my mind messed up my memory. So it was a leaked post on /r/intel. I thought it was for 9000-series, but apparently it was a leak for Cannon-Lake. So I was mistaken.

Second: the post has since been deleted. You can see the claims in the comments still however.

https://old.reddit.com/r/intel/comments/9ol9is/instruction_t...

> FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).

The post used to alleged 20ish cycles for 64-bit division (!!). So I guess that's something to look forward to testing.

I just ran some tests on CNL and indeed the behavior is very different than earlier chips. I am seeing ~15 cycle divs with no pipelining (i.e,. the latency and inverse throughput are both 15), versus 36+ cycles latency and 25+ cycles inv throughput on Skylake.

Interesting. I found only a few other changes beyond that, so far.

CNL also added another AES unit, so you can now dispatch aesenc and its ilk to ports 0 and 1.