Hacker News new | ask | show | jobs
by Tuna-Fish 2749 days ago
A large part of the reason why AMD can reach similar sustained real throughput to Intel despite having a fraction of the FPU throughput is that they run the FPU as a separate unit on different issue ports, and their core is slightly wider when you measure the amount of instructions it can retire.

So even though the Intel CPU can in theory do 4x the computation AMD can in the vector units, in reality even the tightest real vector code does all kinds of things other than vector computation, in the middle of that vector stuff, like computing addresses for loads and stores and managing loop variables. On AMD, those intermixed scalar instructions go into separate scalar ports, on Intel CPUs they take space in the same issue slots that the vector code uses.

Then on top of that, the memory bandwidth is a great equalizer. Doesn't matter how many multiplies you can compute if you cannot load the operands, and the AMD systems are much closer there than they are in the pure computation, especially as they have a lot more L3 cache per core.

On Zen 2, AMD does two big things that are going to really help them in HPC loads. They are doubling vector unit width, and they are doubling the amount of L3 per core. I honestly think the second change will help more than the first.

1 comments

You're right. Also Intel's AVX implementation is very power heavy, and they need to lower CPU frequency to fit into their thermal budget (see "Addenda:" in my previous comment).

Also yes, AMD's memory subsystem has much lower latency, and has higher bandwidth. Also their direct-attach approach is better than Intel. I forgot that advantage TBH :)

However, I can argue about L3s effect on speed. In some cases, the code and the data is so small, but the computation is so heavy that, you can fit almost everything into the caches. I had a 2MB binary which required 200MBs of memory at most, but it completely saturated the CPU in every way imaginable.

So, in some cases caches have great affect on speed. Especially if the data you're invalidating and pulling in is huge. However, if the circulation is slow, a faster FPU always trumps a bigger cache.

> Also yes, AMD's memory subsystem has much lower latency

No, AMD's latency is generally worse than Intel's on Zen chips. Here's the first example I could Google [1], but the same trends repeat themselves across many benchmarks.

My overall impression is that the typical gap is 5-10 ns.

[1] https://www.anandtech.com/show/11544/intel-skylake-ep-vs-amd...

Thanks for the link. I will take a look.

As I clarified below, in my other comment, we were unable to get new Zen systems. So I’m not knowledgeable about their behavior.

However, I need to make my own benchmarks to see how this increased latency affects performance of different work loads and scenarios.

> Also yes, AMD's memory subsystem has much lower latency

Out of interest, what do you mean by this? Are you talking Zen1 or Zen2, because in my experience playing with Zen1 EPYC the memory latency was worse than Xeon Broadwells, and on top of that you had worse NUMA issues that could affect certain cores which weren't directly attached to the memory and this added additional latency more than on the Xeons I was comparing against.

> Are you talking Zen1 or Zen2...

Unfortunately, neither. The last AMDs I was able to play with Opteron 6xxx series. The later ones weren't as fast, and Zen 1 was not easy to obtain, so we were unable to acquire them.

The last ones I used were better from their competitors of the era. I also had a desktop system from that era which was way better, at least for my workloads.

I'd love to play with Zen 1/2 and compare "benchmarks" to "real workloads", because as I said before, in HPC, benchmarks are just numbers.

e.g. Your memory bandwidth may be low, but if it's low latency & you're hammering the bus, bandwidth may not be limiting. OTOH if you're streaming something continuously, your latency becomes moot, because the bus has already queued up everything you need and can continue piling up stuff you need until you process the ones at hand. For the second scenario, I have listened to a talk about an embedded system, which the developers were able to accelerate the system 10x by using an in-cpu accelerator unit to copy required memory segments to cache independently from the CPU.