Hacker News new | ask | show | jobs
by Const-me 2817 days ago
Did it couple of times.

One trick is write output pixels transposed. This way both passes will be identical, and they both read image linearly. Two transposes cancel each other.

Another one is use local memory.

Finally, the right place for the kernel values is compiled into the code, in immediate values. Everything else is slower.