Hacker News new | ask | show | jobs
by E6300 3266 days ago
Are there any applications that are RAM bandwidth-bound, though? The main bottleneck is supposed to be RAM latency.

Going from single channel to dual channel offers like an 8% performance increase, IIRC. Is there any reason to expect any different with quad channel RAM?

2 comments

I can chime in and say that memory bandwidth is our primary bottleneck on servers at my current company.
The only situation where I imagine that could happen is if you need to apply a small number of instructions to a massive data set that's fully loaded in memory. What sort of application are you running? If you can say, obviously.
For sufficiently large values of small. An 8 core CPU @ 3 GHz runs 24 Billion clocks/sec. In any of which a CPU can execute multiple instructions.

Two channel memory systems assuming DDR4@2400 can do about 40GB/sec. Thread ripper is about double that (of the skylake-x, but NOT the kabylake-x). The new skylake xeons are 6 channel (about 120GB/sec) and the new AMD Epyc is 8 channel (about 160GB/sec).

Assuming a perfectly sequential access pattern and something simple like a=b+c (which reads 16 bytes and writes 8 bytes) you can run 1.6 billion of those a second.

So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

So as you can see it can be quite easy to be memory limited. Sure some things do an amazing amount of calculations on very little data. But many things are data intensive, which justifies the large ram and large memory bandwidth machines that make up pretty much all servers shipped today. Memory bandwidth is expensive (CPU package, pins, sockets, motherboard traces, additional motherboard layers, power, etc), but well justified in many cases.

You are again conflating high bandwidth and low latency.

> So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Yes, if you do operations in a lineal access pattern like this the performance will be bound by the bandwidth. This is the situation I was referring to above.

> Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

> Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

> So as you can see it can be quite easy to be memory limited.

No, in this case you will not be limited by the bandwidth, but by the latency. Having more bandwidth will do nothing, because at 8 bytes per 70 ns you're only moving about 109 MiB/s. If 100% of the memory accesses are cache misses (they won't be) and the application uses all cores then yes, doubling the number of memory channels will double the multi-thread performance (unless channel count = core count), although the single-threaded performance will stay unchanged. Additionally, in this particular load you could get away with relatively low frequency RAM, which won't significantly affect the latency but will lower the total bandwidth (it will still be way higher than 109 MiB/s) and will be cheaper.

Er, when I say "memory limited" you say I'm wrong because it's latency limited not bandwidth limited. I think we are violently agreeing. Latency limited is just one specific form of memory limited.

In my testing (to my surprise) it turns out that throughput keeps increasing at up to 2 times the number of memory channels. So with 8 memory channels throughput keeps increasing at up to 16 threads, which upon reflection makes sense. Generally it takes 25-40ns to miss through L1, L2, and L3 -> memory controller. So with 16 misses and 8 channels you end up with all 8 channels busy, and 8 more misses queued and waiting in the memory controller. So your throughput approximately doubles from just 8 threads.

In any case, I agree that single thread performance isn't improved by multiple channels and that latency limited workloads get a small fraction of the potential memory bandwidth.

> The only situation

There are many more:

- Caches/Databases that should keep as much stuff in memory as possible (i.e. if you have a 32gb es instance you actually gain a lot from "fast memory")

- In-Memory Databases

- In-Memory RPC/message broker which of course has as a limitation factor the memory bandwidth.

In all cases the memory bandwidth might be a important factor.

For all the cases you mention, the critical factor is the product of the average transaction size and the transaction count per second. As long as this value is smaller than the RAM bandwidth, the application will not be RAM bandwidth-bound.

Generally speaking, databases are kept in memory to minimize latency, not maximize throughput. Bandwidth is not really a problem. Having to update 10 GB/s of a database would be highly unusual. Having to get data from random positions in a disk or SSD is much more common.

As for the message broken, it's not clear to me why the bandwidth would "of course" be the limiting factor.

That's exactly the case.

I run dedicated gameservers. We preload "phases" which you can affect and transition to/from as a player.

Well what exactly do you mean by latency? Say 8 cores are randomly accessing memory. A quad channel system will have twice the throughput of a 2 channel because there can be twice as many cache misses being handled at once.

For this reason many generations of HEDT and server chips have had 4 channels for many years and are quite justified. Take a 5 year old opteron or xeon for instance, or even an sandy bridge ( i7 like the i7-3820). Sandy bridge is 5 generations old, if 4 channels was justified then it's definitely justified today with today's faster and more numerous cores.

The X-series is new branding, but Intel has been selling i7 chips with the LGA-2011 socket supporting 4 memory channels for years.

So sure if you are cache friendly, great, as many cores as you can fit in a socket. But many applications aren't that cache friendly.

> A quad channel system will have twice the throughput of a 2 channel because there can be twice as many cache misses being handled at once.

Sure, that's the theory, but in practice it doesn't seem to make much of a difference, at least not for dual vs. single.

There exists cache friendly applications that see zero to minimal change with more bandwidth or more channels.

There also exists cache unfriendly applications that see large changes with more bandwidth or more channels.

Games generally are cache friendly, many easy benchmarks are cache friendly. But generally more aggressive use of a machine (which is presumably why you buy a top spec CPU) is generally less cache friendly. Also people notice worst case performance much more than average or best case. Audio skipping, user interface lag, etc.

You can see this effect in action when you compare single thread performance to multithead performance using every CPU. L1 caches are generally note shared, so if it's less than N times faster for N CPUs you are seeing software overhead (the cost of synchronization) or cache misses (in L1, L2, or L3) or of course main memory bottlenecks.

I've seen plenty of cases on older servers where running on all CPUs of single socket was FASTER than all CPUs of two sockets, but that's much less common these days because each socket has it's own memory system.

I can assure you that the entire server market and high end desktop market isn't running 2 to 8 time the memory bandwidth just for fun. The bandwidth is expensive and justified.

An application being cache-unfriendly doesn't imply that it will be bandwidth-bound. If the application reads single words from random locations it will be cache-unfriendly and latency-bound. If it reads 1K contiguous bytes from random locations it will be cache-unfriendly and possibly bandwidth-bound. If it scans the entire memory space sufficiently quickly it may be both cache-friendly and still bandwidth-bound.

I can't speak for the server market, but I'm certain that the high-end desktop market is composed primarily of people who do run top-of-the-line specs just for fun.

Correct, an application that reads single words from random locations will be cache unfriendly and latency bound. However additional memory channels means you can run more of them and get better throughput.

Personally I bought more cores when I can and find that the average and best case are very similar to CPUs with less cores, but the worst case performance is much better. With 8 CPUs I find that the browser, plex, processing batches of photos, transcoding video, running a minecraft server and other random duties have much less of an impact on normal desktop use.

It used to be MUCH easier to be I/O bound with spinning disks, but with the new M.2 SSDs some pretty impressive I/O rates are possible (random or sequential), which makes it easier to be CPU limited.