|
|
|
|
|
by hinkley
253 days ago
|
|
The operation GP is thinking of is a full scan, and that will always take n(n^(1/3)) lower bound time. Though if done right all of that latency will be occupied with computation and allow people to delude themselves into thinking it doesn’t matter. But when something is constrained two or three ways, it drastically reduces the incentive to prioritize tackling any one of the problems with anything but small incremental improvements. |
|
It doesn't. Full scans are faster than accessing each memory address in an unordered way.
Let's look at a Ryzen 2600X. You can sustain 32 bytes per second from L1, 32 bytes per second from L2, and 20 bytes per cycle from L3. That's 64KB, 512KB, and 16MB caches all having almost the same bandwidth despite very different latencies.
You can also imagine an infiniband network that fills 2 racks, and another one that fills 50000 racks. The bandwidth of a single node is the same in both situations, so even though latency gets worse as you add more nodes and hops, it's going to take exactly O(n) time for a single thread to scan the entire memory.
You can find correlations between memory size and bandwidth, but they're significantly weaker and less consistent than the correlations between memory size and latency.