Hacker News new | ask | show | jobs
by jchw 2180 days ago
The problem is a lot of tasks that people want their CPU to be fast at is exactly stuff that parallelizes almost embarrassingly well. Compiling code, video rendering, compressing files. People buying CPUs for this are not as concerned about how many cycles it takes to jump through a vtable as long as its not slow.

Meanwhile, pointing at memory latency as the flaw in Ryzen has been a popular misdirection for a while now. People warned me about it being a performance pitfall since before I bought my first Ryzen processor. In practice it doesn’t show up in even the most complexity intensive workloads as a serious issue. For example, Zen 2 performs very well on hardware emulation. This is possibly because where it takes a hit in memory latency it makes up in caching and prefetching, but honestly I don’t know and I am not sure how to measure. In any case it’s certainly favorably comparable to Intel’s best chipsets in single core workloads even if not on top. Factor in price and multicore workloads and you now have the exact reasons why people like me have been singing the praises... Intel’s single core lead may exist in some form but it is not what it once was, it is not an unconditional lead where an Intel core beats an AMD core. Not even close.

None of this means Intel’s dead of course, but IMO thats mostly because they have a lot more going on than just being the best CPU. They’ve got their dedicated GPU coming out, and plenty of ancillary technology as well. It does seem like for a company like Intel having to take a backseat in CPUs for a while will be painful; unlike AMD, this is a new position for Intel and maybe not one they will handle well.

4 comments

You can get an idea of how popular different processors are in the server space by looking at the AWS EC2 spot market. Top end Xeon server processors (C5 and Z1d) typically have much lower spot discounts than AMD EPYC based processors (r5ad), although ARM c6g instances have been pushed up in price significantly over the last few months, perhaps as people switch over to them for the per-computational-unit cost savings.

Of course, this is all a factor of Amazon's supply of instances and their chosen on-demand pricing level, but the trends are certainly interesting, and show steady demand for fast Xeon's and increasing demand for ARM's. I have run some compute heavy workloads on the best AMD's I could find on AWS and the speed difference per core for my particular workload was nearly 50%, which got worse as it scaled up to bigger instances because my workload uses a lot of L3 cache. I hear about EPYC's with 256MB of L3 cache but I can't seem to find those on AWS -- only ones with 8MB of cache.

Disclosure: I work at AWS on building cloud infrastructure

C6g instances only launched on June 11. I'm not sure what information can be gleaned from the spot prices regarding Arm demand at this time.

The C5a instances powered by AMD Rome processors have 192 MiB of L3 cache per socket total (16 MiB L3 slice per compute complex, 12 CCX per socket). You can observe this from the cpuid(1) output:

   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x40 (64)
      lines per tag         = 0x1 (1)
      associativity         = 0x9 (9)
      size (in 512KB units) = 0x180 (384)
384 * 512 KiB = 192 MiB

(you can download cpuid from http://www.etallen.com/cpuid.html)

Thanks for the info -- I must have misinterpreted the spot pricing history chart for c6g. While you're here, does the AWS hypervisor have any means to dedicate a portion of the L3 cache to each virtualized core, or is it a free-for-all for all of the cache space (such that a noisy neighbor could potentially be evicting data held in your L2 cache or even L1 cache by thrashing the L3 cache)?
For instance families like C, M, and R, processor cores are dedicated to one instance, and the virtual processor is pinned 1:1 to the underlying logical processor. Therefore there is no neighbor that is able to use the L1 and L2 caches.

For L3 cache, we try to optimize for the best overall performance for the majority of the time. Smaller instance sizes share L3 cache with other instances. I wouldn't call it a "free for all" given some changes in how the cache hierarchy has been shifting over time (e.g., Skylake-SP L2 cache per core was increased, and the L3 cache is now 'non-inclusive')

I want my video games, email reader, word, youtube, IDE and general python code to run faster. None of those are parallelizing much of anything.
1. It is unlikely the CPU is a serious bottleneck in many of those circumstances. Even if it takes a measurable amount of time, that does not mean a faster CPU will make a meaningful improvement, if even measurable improvement. If you think it will, try overclocking and measuring your gmail load times.

2. Like I said, in my experience Ryzen also competes just fine in single core. It just also decimates in multicore. I’d rather have some tasks run significantly faster than have some run very slightly faster. But that is disregarding the fact that not all tasks are the same and it does in fact win some categories. These CPU architectures are more divergent than usual for lately.

3. Things you think aren’t parallel are. Video games using modern graphics APIs are in fact able to exploit multicore CPUs. Browsers absolutely exploit multicore CPUs. Your system in general will exploit multicore CPUs so during general usage when you are doing more things and have more software running, single core performance will be hurt less. And so on.

Your email reader, word, youtube and IDE isn't likely to push the limits on any modern CPU, your video game is increasingly optimized around multiple cores because modern consoles ship with multi core cpu's and they need all the performance they can out of them. Only thing that might benefit from single cpu performance is probably your general python code.
Gmail and the IDE take ten seconds to load, while youtube is destroying any CPU to watch a 4k video (or 1080p on a battery saving laptop).

Youtube is possibly the single largest root cause for users upgrading laptops over the past 10 years. They made a silent transition to 60 FPS videos last year which cut hundreds of millions of users from watching HD.

Destroying CPU in some configurations....

https://www.youtube.com/watch?v=ef1wAfrMg5I is ~10% of 1 cpu on my desktop using chrome.

OTOH, I know what your talking about, my linux machine hates youtube, but that's because even with the chromium freeworld fork with some codec acceleration its still burning CPU like crazy.

So, a big part of this isn't a hardware problem so much as a software one combined with the constant fights over who's codec is the one true choice. AKA its a youtube and !windows/android+chrome problem.

> Gmail and the IDE take ten seconds to load,

Those tasks are IO-bound, not CPU-bound.

Your concerns have no basis whatsoever.

> Youtube is possibly the single largest root cause for users upgrading laptops over the past 10 years.

No one in the whole world feels the need to upgrade to a high-end workstation because of YouTube videos.

> The problem is a lot of tasks that people want their CPU to be fast at is exactly stuff that parallelizes almost embarrassingly well. Compiling code, video rendering, compressing files.

Compiling code isn't embarrassingly parallel unless you're building some project with lots of files from scratch. Video rendering and compression also don't benefit as well as you may think:

https://www.phoronix.com/scan.php?page=article&item=3900x-39...

Meanwhile, single-threaded performance affects pretty much 100% of what you do.

In the end, I don't think there's a big difference either way.

> Meanwhile, pointing at memory latency as the flaw in Ryzen has been a popular misdirection for a while now.

How is it a misdirection? The data is accurate and memory latency scaling is a well-known issue for simulations like e.g. games (which is a huge market for high end desktop CPUs and also the market 90 % of reviews address), where you can't really explain the performance differences just by higher clocks. It's considered the main reason why much older Intel CPUs can still outperform Ryzen CPUs in games.

On the other hand, if you take something like Cinebench you can literally turn XMP off (thus using JEDEC timings and bus speed) and still get almost the same score (within, say, 2 %). That's because Cinebench is benchmarking pretty much only ALU throughput. That's obviously an important factor for performance, but just as obviously not the only one.