|
|
|
|
|
by Tuna-Fish
2749 days ago
|
|
A large part of the reason why AMD can reach similar sustained real throughput to Intel despite having a fraction of the FPU throughput is that they run the FPU as a separate unit on different issue ports, and their core is slightly wider when you measure the amount of instructions it can retire. So even though the Intel CPU can in theory do 4x the computation AMD can in the vector units, in reality even the tightest real vector code does all kinds of things other than vector computation, in the middle of that vector stuff, like computing addresses for loads and stores and managing loop variables. On AMD, those intermixed scalar instructions go into separate scalar ports, on Intel CPUs they take space in the same issue slots that the vector code uses. Then on top of that, the memory bandwidth is a great equalizer. Doesn't matter how many multiplies you can compute if you cannot load the operands, and the AMD systems are much closer there than they are in the pure computation, especially as they have a lot more L3 cache per core. On Zen 2, AMD does two big things that are going to really help them in HPC loads. They are doubling vector unit width, and they are doubling the amount of L3 per core. I honestly think the second change will help more than the first. |
|
Also yes, AMD's memory subsystem has much lower latency, and has higher bandwidth. Also their direct-attach approach is better than Intel. I forgot that advantage TBH :)
However, I can argue about L3s effect on speed. In some cases, the code and the data is so small, but the computation is so heavy that, you can fit almost everything into the caches. I had a 2MB binary which required 200MBs of memory at most, but it completely saturated the CPU in every way imaginable.
So, in some cases caches have great affect on speed. Especially if the data you're invalidating and pulling in is huge. However, if the circulation is slow, a faster FPU always trumps a bigger cache.