| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by gameswithgo 2532 days ago

In practice the large AMD L3s result in very good performance. The new Ryzen cpus for instance absolutely crush intel cpus at GCC compile times because of them ( https://www.youtube.com/watch?v=CVAt4fz--bQ )

Are there workloads where the AMD suffers due to its l3 design? Maybe, but I've not seen one yet. I would imagine something special like that you could try to arrange thread affinity to avoid non local l3 accesses.

On my 3900x L3 latency is 10.4ns when local.

1 comments

dragontamer 2532 days ago

> Are there workloads where the AMD suffers due to its l3 design?

Databases, particularly any database which benefits from more than 16MB of L3 cache.

> On my 3900x L3 latency is 10.4ns when local.

And L3 latency is >100ns when off-die. Remember, to keep memory cohesive, only one L3 cache can "own" data. You gotta wait for the "other core" to give up the data before you can load it into YOUR L3 cache and start writing to it.

Its clear that AMD has a very good cache-coherence system to mitigate the problem (aka: Infinity Fabric), but you can't get around the fundamental fact that a core only really has 16MB of L3 cache.

Intel systems can have all of its L3 cache work on all of its cores, which greatly benefits database applications.

---------

AMD Zen (and Zen2) is designed for cloud-servers, where those "independent" bits of L3 cache are not really a big problem. Intel Xeon are designed for big servers which need to scale up.

With that being said, cloud-server VMs are the dominant architecture today, so AMD really did innovate here. But it doesn't change the fact that their systems have the "split L3" problem which affects databases and some other applications.

link

gameswithgo 2532 days ago

> Databases, particularly any database which benefits from more than 16MB of L3 cache.

Yes but have you seen this actually measured, as being a net performance problem for AMD as compared to Intel, yet? I understand the theoretical concern.

link

dragontamer 2531 days ago

https://www.phoronix.com/scan.php?page=article&item=amd-epyc...

Older (Zen 1), but you can see how even a AMD EPYC 7601 (32-core) is far slower than Intel Xeon Gold 6138 (20-core) in Postgres.

Apparently Java-benchmarks are also L3 cache heavy or something, because the Xeon Gold is faster in Java as well (at least, whatever Java benchmark Phoronix was running)

link

arantius 2531 days ago

What I see there is that the EPYC 7601 (first graph, second from the bottom) is much faster than the Xeon 6138 -- it's only slower than /two/ Xeons ("the much more expensive dual Xeon Gold 6138 configuration"). The 32-core EPYC scores 30% more than the 20-core Xeon.

link

dragontamer 2530 days ago

There's a lot of different benchmarks there.

Look at PostgreSQL, where the split-L3 cache hampers the EPYC 7601's design.

As I stated earlier: in many workloads, the split-cache of EPYC seems to be a benfit. But in DATABASES, which is one major workload for any modern business, EPYC loses to a much weaker system.

link

gameswithgo 2531 days ago

Thanks, perfect! I'll keep an eye on these to see how the new epycs do.

link

monocasa 2531 days ago

Are their L3 slices MOESI like their L2's are (or at least were). That'd let you have multiple copies in different slices as long as you weren't mutating them.

link

dragontamer 2531 days ago

AMD is using MDOEFSI, according to page 15 of: https://www.hotchips.org/wp-content/uploads/hc_archives/hc29...

However, I can't find any information on what MDOEFSI is. I'm assuming:

* Modified * Dirty * Owned * Exclusive * Forwarding * Shared * Invalid

Any information I look up comes up to an NDA-firewall pretty quickly (be it in performance counters, or hardware level documentation). It seems like AMD is highly protective of their coherency algorithm.

> That'd let you have multiple copies in different slices as long as you weren't mutating them.

Seems like the D(irty) state allows multiple copies to be mutated actually. But its still a "multiple copies" methodology. As any particular core comes up to the 8MB (Zen) or 16 MB (Zen2) limit, that's all they get. No way to have a singular dataset with 32MB of cache on Zen or Zen2.

link