| HN Mirror

It's worth noting that this GPU RAM advantage is usually coupled with a PCIe bus disadvantage, which means that you need to be able to hold a complete working set of data in the GPU long enough to really benefit from the extra bandwidth and horsepower.

If you don't have enough computations-per-byte to perform on the GPU, you will find your total job time starts to be dominated by the time it takes to stage data in and out of the GPU, without being able to keep the GPU cores busy. Even if the CPU is 5-10x slower according to issue rates and RAM bandwidth, it can keep calculating steadily with a higher duty cycle since system RAM can be much larger.

However, the CPU also benefits from locality, so you should still prefer to structure your work into block-decomposed work units if possible. A decomposition which allows you to work through a large problem as a series of sub-problems sized for a modest GPU RAM area will also let the sub-problem rise higher in the CPU cache hierarchy to get more effective throughput. However, if the decomposition adds too much sequential overhead for marshalling or final reduction of results, it may not help versus a monolithic algorithm with reasonably good vectorization/streaming access to the full data.