Hacker News new | ask | show | jobs
by AnthonBerg 2822 days ago
An image blur is a good place to start! Read horizontally from many pixels in parallel, sum them up as parallel as you can, normalize, write back. Repeat for vertical blur - and here it might be best to rotate the image by a quarter of a turn so vertical is horizontal!, because memory access is usually faster that way.
1 comments

Did it couple of times.

One trick is write output pixels transposed. This way both passes will be identical, and they both read image linearly. Two transposes cancel each other.

Another one is use local memory.

Finally, the right place for the kernel values is compiled into the code, in immediate values. Everything else is slower.