Hacker News new | ask | show | jobs
by dan-robertson 1368 days ago
It would be interesting to have a more detailed understanding of why these are the latencies, e.g. this repo has ‘clusters’ but there is surely some architectural reason for these clusters. Is it just physical distance on the chip or is there some other design constraint?

I find it pretty interesting where the interface that cpu makers present (eg a bunch of equal cores) breaks down.

4 comments

Rings are great for latency on low core count situations. The LCC Intel chips all have a massive 512-bit ring bus (2x256-bit in each direction) internally which delivers crazy fast core to core latency. However, this quickly starts to break down under higher core counts. Intel gets around this to some extent with its P and E cores with 4 E cores occupying the same slot on the ring bus as a P core. However, once you start getting over 14-16 slots on the ring bus it starts to get overwhelmed.

So when you switch to mesh buses the interconnect takes up way more space. So one has to compromise between bus width and the amount of area one is using for the interconnects. Typically this means running reduced width buses around the mesh which limits core to core bandwidth. Not so much a big deal if you're running a server, more a problem though if you're trying to run interactively with a user. Unless of course you're Apple and just devote a truckload of die space to dump a fucking mammoth amount of interconnect between your dies.

There's also ancillary concerns as well like fabrication yield. For instance AMD runs chiplets probably because they can mix and match yields and they naturally segment the market. Get a CCX with 3 working cores? Pair it with another and you have a 6C/12T CPU. Get a CCX with 2 working cores? Pair it with another and you get a 4C/8T. Intel either gets a working die or they don't.

The problem here is the interconnect between the CCXs is relatively slow. Dog slow compared to the ring bus. Even running the Infinity Fabric's fclock at 1.8GHz only nets you 57.6GB/sec between CCXs and five times the latency of the ring bus. When you look at a Ryzen 3300 (2x2 CCX) and a Ryzen 3300X (1x4 CCX) the difference in performance is non-trivial and that's the Infinity Fabric dragging performance down. In comparison an Intel core's L3 cache on a 3GHz ring bus (i.e. non-turbo) pulls down at 96GB/sec. Sure you're still ultimately limited by DRAM but if stuff is staying in LLC it's a hell of a performance boost. In Zen 3 AMD even went to 8 core CCXs which gave the whole thing a huge performance boost. Part of that was because the smaller lithography gave them more area to play with so they could fit everything plus the interconnects onto the chiplet size they needed.

So yeah, I hope that little greatly oversimplified, surface level look was helpful.

I found that this is a very insightful overview of chip architectures today. Thank you for taking the time to spell this out!

I had no idea that there were 2x2 and 1x4 chips. Do you have a link that compares those in performance?

https://www.gamersnexus.net/hwreviews/3581-amd-ryzen-3-3300x...

The 3300X is consistently ahead of the 3100.

You have a typo: it's Ryzen 3100 (2x2 CCX). There is no such thing as Ryzen 3300
Most of this cross-core overhead diversity is gone on skylake and newer chips because Intel moved from a ring topology to mesh design for their l3 caches.
Just look at the processor architecture diagram.

But TL;DR modern big processors are not one big piece of silicon but basically "SMP in a box", a bunch of smaller chiplets interconnected with eachother. That helps with yield ("bad" chiplet costs you just 8 cores, not whole 16/24/48/64 core chip). Those also usually come with their own memory controllers.

And so you basically have NUMA on a single processor with all of the optimization challenges for it

Some of it is simple distance. Some of it is architectural choices because of the distance. A sharing domain that spans a large distance performs poorly because of the latency. Therefore domains are kept modest, but the consequence is crossing domains has an extra penalty.