|
|
|
|
|
by sliverstorm
4616 days ago
|
|
I would think it is going to depend on your cache size. Piecewise will be better if the image can't all fit in the cache, but if the image is small enough you can fit everything in the cache. Or, do you mean that memory is the bottleneck as in, shipping the image to the GPU's memory space? |
|
And the problem with that is, you can't guess the cache size. You can help yourself with profiling, but this leads to a local optimization for only some GPUs.
If you wish to run your code optimized for any GPU, the pixel-by-pixel approach usually works best. Then, the GPU scheduler can run as many neighboring threads as possible in subprocessors. Note that every subprocessor has another local cache which is really quick.