| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by bayindirh 2751 days ago

From what I've read now [0], it looks like AMD still uses 2 x 128bit AVX units to execute AVX2 instructions. Also, AMD is always coming a generation behind Intel in terms of FP instructions sets, so Zen doesn't support AVX512.

According to WikiChip [4], Zen 2 actually has 256 bit FPU paths. I was unable to find a credible benchmark for Zen 2, so I can't talk about its performance. However, when analyzed from the perspective I've given below, it's not hard to assume that Zen 2 is a heavy hitter in terms of floating point performance.

However, the interesting part is, when you look to SpecCPU 2017 FP Rate [1], AMD Epyc 7601 [2] system has a similar per core performance with a much bigger Intel Xeon Platinum 8180 [3] system.

Why interesting?

    * AMD's per core base (lowest) rate is 4.1875.
    * Intel's per core base (lowest) rate is 4.3482.
    * AMD is running GCC compiled code.
    * Intel is running Intel compiled code.
    * Intel has higher clock speed.

Intel has some CPUs (like Gold 5118, Gold 6148) which have per core base rate of ~5.125. These are the CPUs are considered as HPC processors, and used by a lot of people.

As I said before, it looks like Zen 2 is going to be a better HPC processor than Zen. Zen looks like a very good Enterprise processor now.

So with my hat, I can conclude that not having 512 bit hardware is not a crippling omission.

Addenda: I forgot to say that Intel has something called "AVX frequency". Since AVX, AVX2 and AVX512 has tremendous power requirements when compared to other operations, Intel lowers CPU to an undisclosed frequency. When I last checked, AVX frequencies of Intel CPUs that we use weren't in the technical guides and were not public in any way. So, the peak SpecFP Rate is not very different from the base ones.

Also, since the CPUs thermal budget is very constrained during AVXx operations, other ports' speed is also reduced. So at the end of the day, AVX512 is not a free turbo boost in HPC environments and heavy/continuous loads.

[0]: https://en.wikichip.org/wiki/amd/microarchitectures/zen#Floa... [1]: http://spec.org/cpu2017/results/rfp2017.html [2]: http://spec.org/cpu2017/results/res2018q4/cpu2017-20180917-0... [3]: http://spec.org/cpu2017/results/res2017q4/cpu2017-20171017-0... [4]: https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Ke...

1 comments

Tuna-Fish 2751 days ago

A large part of the reason why AMD can reach similar sustained real throughput to Intel despite having a fraction of the FPU throughput is that they run the FPU as a separate unit on different issue ports, and their core is slightly wider when you measure the amount of instructions it can retire.

So even though the Intel CPU can in theory do 4x the computation AMD can in the vector units, in reality even the tightest real vector code does all kinds of things other than vector computation, in the middle of that vector stuff, like computing addresses for loads and stores and managing loop variables. On AMD, those intermixed scalar instructions go into separate scalar ports, on Intel CPUs they take space in the same issue slots that the vector code uses.

Then on top of that, the memory bandwidth is a great equalizer. Doesn't matter how many multiplies you can compute if you cannot load the operands, and the AMD systems are much closer there than they are in the pure computation, especially as they have a lot more L3 cache per core.

On Zen 2, AMD does two big things that are going to really help them in HPC loads. They are doubling vector unit width, and they are doubling the amount of L3 per core. I honestly think the second change will help more than the first.

link

bayindirh 2751 days ago

You're right. Also Intel's AVX implementation is very power heavy, and they need to lower CPU frequency to fit into their thermal budget (see "Addenda:" in my previous comment).

Also yes, AMD's memory subsystem has much lower latency, and has higher bandwidth. Also their direct-attach approach is better than Intel. I forgot that advantage TBH :)

However, I can argue about L3s effect on speed. In some cases, the code and the data is so small, but the computation is so heavy that, you can fit almost everything into the caches. I had a 2MB binary which required 200MBs of memory at most, but it completely saturated the CPU in every way imaginable.

So, in some cases caches have great affect on speed. Especially if the data you're invalidating and pulling in is huge. However, if the circulation is slow, a faster FPU always trumps a bigger cache.

link

BeeOnRope 2751 days ago

> Also yes, AMD's memory subsystem has much lower latency

No, AMD's latency is generally worse than Intel's on Zen chips. Here's the first example I could Google [1], but the same trends repeat themselves across many benchmarks.

My overall impression is that the typical gap is 5-10 ns.

[1] https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

link

bayindirh 2751 days ago

Thanks for the link. I will take a look.

As I clarified below, in my other comment, we were unable to get new Zen systems. So I’m not knowledgeable about their behavior.

However, I need to make my own benchmarks to see how this increased latency affects performance of different work loads and scenarios.

link

berkut 2751 days ago

> Also yes, AMD's memory subsystem has much lower latency

Out of interest, what do you mean by this? Are you talking Zen1 or Zen2, because in my experience playing with Zen1 EPYC the memory latency was worse than Xeon Broadwells, and on top of that you had worse NUMA issues that could affect certain cores which weren't directly attached to the memory and this added additional latency more than on the Xeons I was comparing against.

link

bayindirh 2751 days ago

> Are you talking Zen1 or Zen2...

Unfortunately, neither. The last AMDs I was able to play with Opteron 6xxx series. The later ones weren't as fast, and Zen 1 was not easy to obtain, so we were unable to acquire them.

The last ones I used were better from their competitors of the era. I also had a desktop system from that era which was way better, at least for my workloads.

I'd love to play with Zen 1/2 and compare "benchmarks" to "real workloads", because as I said before, in HPC, benchmarks are just numbers.

e.g. Your memory bandwidth may be low, but if it's low latency & you're hammering the bus, bandwidth may not be limiting. OTOH if you're streaming something continuously, your latency becomes moot, because the bus has already queued up everything you need and can continue piling up stuff you need until you process the ones at hand. For the second scenario, I have listened to a talk about an embedded system, which the developers were able to accelerate the system 10x by using an in-cpu accelerator unit to copy required memory segments to cache independently from the CPU.

link