| For sufficiently large values of small. An 8 core CPU @ 3 GHz runs 24 Billion clocks/sec. In any of which a CPU can execute multiple instructions. Two channel memory systems assuming DDR4@2400 can do about 40GB/sec. Thread ripper is about double that (of the skylake-x, but NOT the kabylake-x). The new skylake xeons are 6 channel (about 120GB/sec) and the new AMD Epyc is 8 channel (about 160GB/sec). Assuming a perfectly sequential access pattern and something simple like a=b+c (which reads 16 bytes and writes 8 bytes) you can run 1.6 billion of those a second. So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound. Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse. Suddenly instead of needing 15 times more instructions you need
2500 instructions per memory load, all without a extra cache miss. So as you can see it can be quite easy to be memory limited. Sure some things do an amazing amount of calculations on very little data. But many things are data intensive, which justifies the large ram and large memory bandwidth machines that make up pretty much all servers shipped today. Memory bandwidth is expensive (CPU package, pins, sockets, motherboard traces, additional motherboard layers, power, etc), but well justified in many cases. |
> So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.
Yes, if you do operations in a lineal access pattern like this the performance will be bound by the bandwidth. This is the situation I was referring to above.
> Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.
> Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.
> So as you can see it can be quite easy to be memory limited.
No, in this case you will not be limited by the bandwidth, but by the latency. Having more bandwidth will do nothing, because at 8 bytes per 70 ns you're only moving about 109 MiB/s. If 100% of the memory accesses are cache misses (they won't be) and the application uses all cores then yes, doubling the number of memory channels will double the multi-thread performance (unless channel count = core count), although the single-threaded performance will stay unchanged. Additionally, in this particular load you could get away with relatively low frequency RAM, which won't significantly affect the latency but will lower the total bandwidth (it will still be way higher than 109 MiB/s) and will be cheaper.