Hacker News new | ask | show | jobs
by yifanlu 2382 days ago
Assuming both Intel and AMD implement performance monitors the same (i.e. same notion of instructions executed, which may be hard to measure with speculative execution), the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time.

> However, it is not clear whether these reports are genuinely based on measures of instruction per cycle. Rather it appears that they are measures of the amount of work done per unit of time normalized by processor frequency.

That’s precisely why nobody really uses IPC as a way to compare processors. “How much work done per unit of time” is a much better measurement and I guess for historical reasons, people conflate it with IPC.

But real textbook IPC is useless for comparison.

6 comments

I think it would have been useful if the author benchmarked the actual time taken to parse a large json file, and did a sanity check to make sure the time difference made sense with ipc/clock factored in.
> But real textbook IPC is useless for comparison.

It's useful for comparing architectures and the implementation thereof, to gauge the potential of one line of processors over the other.

I agree that for the customer it's not the right thing to be looking for.

It's really not useful for gauging potential. There are tradeoffs in how deeply you pipeline your architectures that'll tend to result in higher clock rates for shorter pipeline stages but higher IPC for longer pipeline stages, for instance. It's pretty easy to make a design with an IPC that'll blow everything else out of the water if it only needs to hit 100 MHz. For instance the slower a clock cycle is the larger you can make your caches and the less clock cycles it takes to read from them.

Also, on real world benchmarks that don't fit neatly in cache, for a given chip IPC will tend to increase as you underclock it because that will cause memory latency to go down.

Note that the chip with the higher specific frequency in this test, and the higher max frequency across the product line (Skylake+), gets a higher IPC here, so this kind of tradeoff isn't the obvious cause of the results here.
IPC is _usually_ a good measure for the last phase of optimization. But it is only the local Δ that is meaningful, comparing IPC across different vendors is only useful as a gross measure.
It's not even useful as a gross measure, unfortunately. Too many moving parts in the way.

Say, if you used IPC only then you'd probably pick the latest Apple ARM CPU. Except it cannot go as high clock in any of the subunits as top AMD and Intel, cache is slower, and memory bandwidth abysmal in comparison.

Performance in seconds or performance per watt (unit is 1/(W*s)) in the workload you want to run is useful.

You cannot even estimate anything using microbenchmarks anymore easily since they expanded per unit local clocking in x86... (AMD in Zen+ and expanded in Zen 2, most ARM mobile CPUS, Intel since Broadwell E, expanded in Skylake.)

You get traps such as going for AVX and locally overheating the CPU where SSE2 equivalent would go faster in real life. It's all funny business.

IPC also heavily favors RISC instead of SIMD, likewise is biased against multicore. (Though not as much.) What counts as an instruction anyway?

There's no "gauging potential". Would you suddenly go with OICC if it has extremely high IPC? How about old Core instead of new Skylake? Oh shoot, there is no potential in Core if it's not being made!

Even different Zen 2 CPUs have varied performance properties not just due to cores, but due to CCX count.

The exactly one use for such microbenchmark and that's optimizing the compilers.

Even if there were multiple implementations?

Also remember that x86-64 unlike x86 is not closed, and unlike POWER, RISC-V, ARM or MIPS is not actually well defined.

If AMD suddenly adds a new but useful instruction set like they did with 3DNOW in ancient times, or accelerate something reasonably common that way, say add a special SIMD conditional, where do you even start in comparison? What if Intel actually does add a useful FPGA programmable computing capability as promised or enhanced DMA?

I don’t see how it is flawed. The article doesn’t discuss whether the AMD CPU is faster than the Intel CPU, it discusses the claim "that the most recent AMD processors surpass Intel in terms of instructions per cycle” (https://www.guru3d.com/articles_pages/amd_ryzen_7_3800x_revi...)

And IPC, IMO, is a better measurement for a chip’s design than pure speed, as it removes the “but how good a process do you have access to” from the equation.

The article gives 2 benchmarks, I am pretty sure it is easy to mash up another benchmark with totally opposite results (e.g subset of specint). I found author's inclusion of an obviously skewed example as proof a little bit disingenuous as well.

Having said that in general Intel still holds a slight edge on pure Ipc. However, considering the terrible track record of security issues and abysmal price performance ratio, a slight edge on ipc can be ignored and I would not consider Intel for most workloads at the moment. Above all, actual application benchmark trumps any ipc microbencmark.

Some more realistic single core workloads at same frequencies: 3900x vs 9900k https://hothardware.com/reviews/amd-ryzen-9-3900x-vs-core-i9...
I'm this case the frequencies are similar and so wall clock time reflects the IPC difference (also, the two CPUs take the same code path, so the I is the same in this case, which isn't always true).
But on these processors, I believe the frequency is rarely sustained right? Due to thermal throttling and other factors.
That only really happens on laptops, which can't dissipate as much heat as desktop systems due to size constraints. On a desktop, if you're using even AMD's stock cooler, you won't thermal throttle. That is, if you don't overclock.
Modern processors with boost configurations are rather complicated about "thermally throttling". These days with AMD's stock coolers you will be able to at least get the sticker speed on the CPU even at 100% load for a sustained time. Chances are, you'll actually get some % more speed than the sticker as it will usually continue to boost as long as power delivery and temperatures are stable. So even with an entirely stock configuration, a better motherboard and cooling system will overall net you more performance. This is without doing any traditional "overclocking" and just going with the settings designated with the CPU and motherboard. This same idea also applies to most of Intel's parts as well.
It's not about throttling. What will happen is that the CPU won't automatically clock up dynamically as much if you have worse cooling.

They behave like GPUs more and more with regards to clocks.

That's the same thing. Intel calls their stuff a dynamic boost so that some of their measurements like TDW are for lower clocks. Both CPUs end up scaling their clocks to a wide range.
>They behave like GPUs more and more with regards to clocks.

I think it's the other way around? CPUs had "boost" before GPUs.

Could be that one process spends much less time sleeping for IO thus still having the same wall clock time.

In this case, there's probably only memory IO which (afaik) cannot put a process to sleep.

There is no IO, and it is not memory intensive.
"the comparison is still flawed because it doesn’t matter if Intel can do more instruction per cycle if AMD can produce more cycles in a span of wall time."

The reason Intel had the "per core" superiority crown for years is that it had a better IPC performance due to design efficiency. Both manufacturers are pushing against the same frequency ceiling, so if you went AMD you had to significantly increase the core count to catch up, and could never match the still important single-thread performance.

We know from large scale, comprehensive benchmarks that AMD has massively picked up the pace and is neck and neck with Intel. At the same processor speed it matches the best Intel processors.

But yeah, this article is just terrible. Not just tiny, minuscule, extremely myopic benchmarks, but then a gross over-reach with conclusions. And in the way that ignorance begets ignorance, the fact that it's trending on a couple of social news sites means that now Google is surfacing it as canonical information when it's just a junk, extremely lazy analysis.

He ran a few basic tests, and showed the results. Where was the "gross over-reach"? The article ends with a "your mileage may vary" disclaimer.
"So AMD runs at 2/3 the IPC of an old Intel processor. That is quite poor!"

That is most certainly an overreach. An extraordinary overreach. Worse, it's absurdly using an AVX2 codebase, optimized for Westmere, as the baseline for "IPC" testing? The premise itself borders of gross negligence.

IPC as a generalized concept is a broad, general purpose set of instructions, not an absurdly narrow test.

Saying "Intel is faster at AVX512" is going to surprise exactly no one, and also happens to be irrelevant for the overwhelming majority of users and uses.

The microbenchmarking thing has gone on for years, and at this point anyone who has paid any attention is rightly cautious when stomping their feet and making declarations, because usually they're just pouring noise into the mix. Lazily running a couple of tiny tests is not the rigour to avoid deserved criticism.

I'm not sure if you were implying it or just using it as example of another type of unhelpful claim, but this test does not involve AVX-512.

I agree using Westmere isn't necessarily the best approach, but there is no difference in this case with either -march=native or -march=znver1.

The loop is small and simple, with only 9 instructions and compiles more or less the same regardless of march setting (I observed some basically no-op changes such as a mov and blsr swapping places). Here's the assembly (for the second test, with the bigger IPC gap):

    top:
    tzcnt  r8,rcx
    add    r8d,edx
    mov    DWORD PTR [rdi+rax*4],r8d
    mov    eax,DWORD PTR [rsi]
    inc    eax
    blsr   rcx,rcx
    mov    DWORD PTR [rsi],eax
    jne    .top
"I'm not sure if you were implying it or just using it as example of another type of unhelpful claim, but this test does not involve AVX-512."

Even worse! Is this a defense, because it's remarkably unhelpful as one.

The blog post was clearly a cry for attention for some project -- let's just use some clickbait IPC claims to gain it -- and continually alluded to a whole project -- an extreme niche project that still wouldn't have any relevance. But instead it's a meaningless, completely misrepresentative micro-loop.

My read is different than yours.

I think Daniel uses those examples because they are actual examples from projects that he is or has been working on, and he's familiar with them and actually cares about them, and because it's at least a notch more realistic than something totally synthetic.

It seems like a very roundabout thing to use as a cry for attention for SIMDjson (the project I assume you are talking about), and I don't believe that's the purpose. I see no problem in linking the project.

Picking two random benchmarks and trying to extract any kind of more general IPC claim is not on solid ground, but I'm pretty sure Daniel will say he's not doing that: he's only sharing these two specific results. That's a style that reoccurs across several entries in that blog, however, so if it triggers (as it has me on occasion) you might want to look elsewhere.

Doesn't that sentence refer only to the table above, measuring "bitset decoding" with a basic decoder, comparing 1.4 to 2.1 IPC?

It would help if the blog post had some headings to separate the benchmarks and summary.

A plain reading indicates that yes, he's only referring to the last benchmark, which showed the 2/3 disparity.
A plain reading indicates that such is irrelevant, because these are the two tiny cases that he selectively chose to demonstrate the "IPC gap" of AMD. If some AMD booster posted hand-selected micro-benchmarks that gave AMD a lead, and boasted with exclamations and pejoratives how terrible the alternative is, we would rightly question it. This deserves no more.

And to the other defense of "Well there are AMD people claiming the same in reverse, so that legitimizes this", I've seen exactly zero of those posts on here. None. They would be laughed off the site.

What we do have is that traditionally at a given frequency, per core AMD has long trailed on major benchmarks of significant, user-realistic loads. This is the the first generation in a long time where it actually doesn't, and where you don't need additional cores to make up the gap.

> surfacing it as canonical information when it's just a junk, extremely lazy analysis.

Isn't that what the Internet is for?