Hacker News new | ask | show | jobs
by _urga 2025 days ago
The advertised bandwidth for RAM is not actually what you get per-core, which is what you care about in practice.

If you want to know the upper bound on your per-core RAM bandwidth:

64 bytes (the size of a cache line) * 10 slots (in a CPU core's LFB or line fill buffer) / 100ns (the typical cost of a cache miss) * 1000000 * 1000 (to convert ns to ms to seconds) = 6400000000 bytes per second = 5.96 GiB per second RAM bandwidth per core

There's no escaping that upper bound per core.

Nanosecond RAM latencies don't help much when you're capped by the line fill buffer and queuing delay kicks in spiking your cache miss latencies. You can only fetch 10 lines at a time per core and when you exceed your 5.96 GiB per second budget your access times increase.

If you compare with NVMe SSD throughput plus Direct I/O plus io_uring, around 32 GIB per second and divide that by 10 according to the difference in access latencies, then I think the author is about right on target. The point they are making is valid: it's the same order of magnitude.

3 comments

While I was in the hospital ICU earlier this year, I promised myself I would build a zen 3 desktop when it came out despite my 10 year old desktop still working just fine.

I've since bought all the pieces but the CPU; they are all sold out. So I got a 6 core 3600XT in the interim. I bought fairly high binned RAM and overclocked it to 3600Mhz, and was surprised to cap out at about 36GB/s throughput. Your 6GiB/s per core explanation checks out for me!

Cool! I had a similar empirical experience working on a Cauchy Reed-Solomon encoder awhile back, which is essentially measuring xor speed, but I just couldn't get it past 6 GiB/s per core either, until I guessed I was hitting memory bandwidth limits. Only a few weeks ago I stumbled on the actual formula to work it out!
> capped by the line fill buffer and queuing delay kicks in spiking your cache miss

could you point me to a little reading material on this? I know what an LFB is, more or less, but what queueing delay, an dhow does that relate to cache misses? Thanks.

Sure, I'm still pretty fuzzy on these things, but queueing delay is Little's law: https://en.wikipedia.org/wiki/Little's_law

It means if a system can only do X of something per second, then if you push the system past that, new arriving stuff has to wait on existing work in the queue, and things take longer than if the queue was empty. You can think of it like a traffic jam and it applies to most systems.

For example, our local radio station here in Cape Town loves to talk about "queuing traffic" when they do the 8am traffic report, and I always think of Little's law.

Bufferbloat is another example of queueing delay, e.g. where you fill the buffer of your network router say with a large Gmail attachment upload and spike the network ping times for everyone else sharing the same WiFi.

Here is where I got the per-core bandwidth calculation from: https://www.eidos.ic.i.u-tokyo.ac.jp/~tau/lecture/parallel_d...

Appreciated, thanks
What about prefetching? Tiger Lake gets over 20 GB/s per core. https://www.anandtech.com/show/16084/intel-tiger-lake-review...
From your link

> In the DRAM region we’re actually seeing a large change in behaviour of the new microarchitecture, with vastly improved load bandwidth from a single core, increasing from 14.8GB/S to 21GB/s

Yeah, that's odd. But the article's really about cache, so maybe it's a mistake. Next para says

> More importantly, memory copies between cache lines and memory read-writes within a cache line have respectively improved from 14.8GB/s and 28GB/s to 20GB/s and 34.5GB/s.

so it looks like it's talking about cache not ram but... shrug

Beats me!