| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sliken 3265 days ago

For sufficiently large values of small. An 8 core CPU @ 3 GHz runs 24 Billion clocks/sec. In any of which a CPU can execute multiple instructions.

Two channel memory systems assuming DDR4@2400 can do about 40GB/sec. Thread ripper is about double that (of the skylake-x, but NOT the kabylake-x). The new skylake xeons are 6 channel (about 120GB/sec) and the new AMD Epyc is 8 channel (about 160GB/sec).

Assuming a perfectly sequential access pattern and something simple like a=b+c (which reads 16 bytes and writes 8 bytes) you can run 1.6 billion of those a second.

So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

So as you can see it can be quite easy to be memory limited. Sure some things do an amazing amount of calculations on very little data. But many things are data intensive, which justifies the large ram and large memory bandwidth machines that make up pretty much all servers shipped today. Memory bandwidth is expensive (CPU package, pins, sockets, motherboard traces, additional motherboard layers, power, etc), but well justified in many cases.

1 comments

E6300 3265 days ago

You are again conflating high bandwidth and low latency.

> So to not be memory bound you need to run an extra 15 times more instructions... without adding any cache misses, just to execute one instruction per cycle (a fraction of the possible). If it's less you are memory bound.

Yes, if you do operations in a lineal access pattern like this the performance will be bound by the bandwidth. This is the situation I was referring to above.

> Now imagine it's not perfectly sequential, and instead you have to retrieve something from memory before you know where to go next. Like say a database index, binary tree, or linked list. Instead of getting 8 bytes @ 2400 Mhz you get 8 bytes per 70 ns. Keep in mind that's 8 bytes per 1/2.4 ns vs 70 or 168 times worse.

> Suddenly instead of needing 15 times more instructions you need 2500 instructions per memory load, all without a extra cache miss.

> So as you can see it can be quite easy to be memory limited.

No, in this case you will not be limited by the bandwidth, but by the latency. Having more bandwidth will do nothing, because at 8 bytes per 70 ns you're only moving about 109 MiB/s. If 100% of the memory accesses are cache misses (they won't be) and the application uses all cores then yes, doubling the number of memory channels will double the multi-thread performance (unless channel count = core count), although the single-threaded performance will stay unchanged. Additionally, in this particular load you could get away with relatively low frequency RAM, which won't significantly affect the latency but will lower the total bandwidth (it will still be way higher than 109 MiB/s) and will be cheaper.

sliken 3262 days ago

Er, when I say "memory limited" you say I'm wrong because it's latency limited not bandwidth limited. I think we are violently agreeing. Latency limited is just one specific form of memory limited.

In my testing (to my surprise) it turns out that throughput keeps increasing at up to 2 times the number of memory channels. So with 8 memory channels throughput keeps increasing at up to 16 threads, which upon reflection makes sense. Generally it takes 25-40ns to miss through L1, L2, and L3 -> memory controller. So with 16 misses and 8 channels you end up with all 8 channels busy, and 8 more misses queued and waiting in the memory controller. So your throughput approximately doubles from just 8 threads.

In any case, I agree that single thread performance isn't improved by multiple channels and that latency limited workloads get a small fraction of the potential memory bandwidth.