Hacker News new | ask | show | jobs
by daxfohl 1840 days ago
GPU L0 cache latency IIUC is ~20x higher than CPU. In fact in this case I think GPU would have to use L2 cache since the data is shared across so many cores, so now you're talking ~50x. So even if you get full parallelism of cell computation you can plug in the numbers and find it would be far slower than FPGA (but still faster than CPU).

I'm not an expert though. Maybe GPUs have some way of mitigating the high cache latencies.

1 comments

The main trick GPUs use is having a massive amount of hardware threads per actual core. If an instruction in one thread is stalled on a load operation, the core will just switch to another thread. If you have more runnable threads at all times than your memory latency in cycles, the latency will not affect your throughput anymore.