| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by BeeOnRope 2749 days ago

> They're my favorite instructions too, and I really hope Zen2 fixes their performance problems.

Me too. AFAIK their slowness is probably due to requiring a specialized functional unit to implement. Something like the unit described in this paper [1].

> Allegedly, Intel improved the integer-division instruction to ~20 clock cycles on the 9000-series

Do you have a reference?

That would be weird if it applied only to the 9000 series, and not other Coffee Lake cores. After all, it's the same core, reportedly unchanged all the way back to Skylake [2], so how could the divider be faster?

FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).

> Even then, AVX2 code is more efficient to decode and run. So even if its emulated on AMD's platform, there are benefits to writing AVX2 code.

Yes, that's why I said you _might_ make _some_ different decisions, such as in any algorithm that doesn't scale cleanly to 256 bits, but still ends up faster when the CPU offers full 256 bit ALUs (so 256 bit and 128 bit ops have the same performance).

One real-world example would be something that uses a vector-width lookup table, say for a shuffle mask. With 2 possibilities for each DWORD element, a 128-bit shuffle mask only needs 16 entries, but 256-bit masks need 256 and they are twice as large (8 KiB in total!). With fast 256-bit units you might suck up this penalty, since it might end up faster overall, but with 128-bit units you might be better off going with the much smaller table and 128-bit lookups, at the same total throughput.

> Remember that its not a pure win on Intel systems either: use of any YMM register begins to downclock the whole chip,

Well not really anymore. Most (all?) recent chips don't downclock for use of 256-bit registers (not counting "high lane powerup"). Only some server chips downclock for "heavy" AVX2 use, which really means a lot of back-to-back FMAs or other heavy FP operations. In general the penalty for 256-bit instructions is small on recent cores (a larger penalty is paid for AVX-512), and compilers generally use them freely (the same is not true for 512-bit) and effectively.

---

[1] https://github.com/tpn/pdfs/blob/master/Fast%20Bit%20Compres...

[2] I think there must be some small changes, since the LSD was re-enabled, implying that they fixed the bug where registers could be corrupted when using the high half of the GP byte registers.

1 comments

dragontamer 2749 days ago

> Do you have a reference?

Yes and no. Apparently, my mind messed up my memory. So it was a leaked post on /r/intel. I thought it was for 9000-series, but apparently it was a leak for Cannon-Lake. So I was mistaken.

Second: the post has since been deleted. You can see the claims in the comments still however.

https://old.reddit.com/r/intel/comments/9ol9is/instruction_t...

> FWIW, even for Skylake, Agner reports 26 cycles for a 32-bit idiv, so the chip is already close (if you were talking 32-bit division).

The post used to alleged 20ish cycles for 64-bit division (!!). So I guess that's something to look forward to testing.

link

BeeOnRope 2749 days ago

I just ran some tests on CNL and indeed the behavior is very different than earlier chips. I am seeing ~15 cycle divs with no pipelining (i.e,. the latency and inverse throughput are both 15), versus 36+ cycles latency and 25+ cycles inv throughput on Skylake.

Interesting. I found only a few other changes beyond that, so far.

link

pbsd 2748 days ago

CNL also added another AES unit, so you can now dispatch aesenc and its ilk to ports 0 and 1.

link