| HN Mirror

Actually, its pretty easy to get bandwidth-bottlenecked in GPU-compute.

I know video games don't really get bandwidth bottlenecked, but all you gotta do is perform a "Scan" or "Reduce" on the GPU and bam, you're PCIe bottlenecked. (I recommend NVidia CUB or AMD ROCprim for these kinds of operations)

CUB Device-reduce is extremely fast if the data is already on the GPU: https://nvlabs.github.io/cub/structcub_1_1_device_reduce.htm.... However, if the data is CPU / DDR4 RAM side, then the slow PCIe connection hampers you severely.

I pushed 1GB of data to device-side reduce the other day (just playing with ROCprim), and it took ~100ms to hipMemcpy the 1GB of data to the GPU, but only 5ms to actually execute the reduce. That's a PCIe-bottleneck for sure. (Numbers from memory... I don't quite remember them exactly but that was roughly the magnitudes we're talking about). That was over PCIe 3.0 x16, which seems to only push 10GBps one-way in practice. (15Gbps in theory, but practice is always lower than the specs)

Yeah, I know CPU / GPU have like 10us of latency, but you can easily write a "server" kind of CPU-master / GPU-slave scheduling algorithm to send these jobs down to the GPU. So you can write software to ignore the latency problem in many cases.

Software can't solve the bandwidth problem however. You gotta just buy a bigger pipe.