| HN Mirror

Yes, it really depends on your cache size.

And the problem with that is, you can't guess the cache size. You can help yourself with profiling, but this leads to a local optimization for only some GPUs.

If you wish to run your code optimized for any GPU, the pixel-by-pixel approach usually works best. Then, the GPU scheduler can run as many neighboring threads as possible in subprocessors. Note that every subprocessor has another local cache which is really quick.