Hacker News new | ask | show | jobs
Optimizing Gaussian blurs on a mobile GPU (sunsetlakesoftware.com)
54 points by jcxplorer 4616 days ago
9 comments

The page isn't loading for me but I will chime in with an image smoothing optimization trick I've used. If you don't need an exact Gaussian blur you can approximate it very efficiently using integral images: http://en.wikipedia.org/wiki/Summed_area_table

The more "boxes" you use in your sliding window (analogous to a convolution kernel), the better you can approximate a Gaussian kernel. If you're using a big Gaussian kernel on a big image then integral images can result in a large reduction in the number of operations performed and thus potentially enable a big speed-up.

In addition to what was already there, I'd suggest looking at trying to operate on the whole image at a time rather than pixel-by-pixel. To take a simple blur example, a blur that samples the local pixel at 50%, and each of the top, bottom, left, and right by 12.5% is equivalent to drawing the entire image half darkened, then drawing it in each of one pixel to the left, right, up, and down at 12.5% of the original brightness, merging additively. I don't know if this would be faster or not, you'd have to profile it, but I'd consider it worth a try.
On the contrary. Consider that the memory is the bottleneck when performing the blur. I understand you would create five instances of the images (50%, 12.5% Left, R, Top, B). This would worsen the bottleneck even more.

Additionally, the advantage of computing pixel by pixel is that the shader can operate massively parallel.

I think I wasn't clear. I mean, send 5 textured polygons to the 3D hardware, with various darkenings and offsets on one texture (which if nothing else can be done via lighting, but there's probably other easier ways), and let it do the blending en masse. Instead of using shaders and blinding it to what you're doing, it may be able to render the polygons much faster, on an optimized path. And it may not. But it's worth a try.
I would think it is going to depend on your cache size. Piecewise will be better if the image can't all fit in the cache, but if the image is small enough you can fit everything in the cache.

Or, do you mean that memory is the bottleneck as in, shipping the image to the GPU's memory space?

Yes, it really depends on your cache size.

And the problem with that is, you can't guess the cache size. You can help yourself with profiling, but this leads to a local optimization for only some GPUs.

If you wish to run your code optimized for any GPU, the pixel-by-pixel approach usually works best. Then, the GPU scheduler can run as many neighboring threads as possible in subprocessors. Note that every subprocessor has another local cache which is really quick.

I'm surprised this isn't even mentioned in the article. I guess there are issues with boundary effects, but those can be overcome in a variety of ways.

Also, if your convolution kernel is short enough, you can beat an FFT. So, in 2D with an NxN image, an FFT filter will have a complexity of about (N^2 + N^2 Log N). If your kernel is KxK, then direct convolution will be (N^2 K^2), so you have to compare Log N with K^2. But keep in mind 1.) these are asymptotic complexities the coefficients matter 2.) K may or may not depend on N 3.) FFTs are may have larger memory requirements, although you may be able to get away with in-place FFTs, and you don't have to explicitly compute the FFT of the gaussian - you can work that out analytically (you should get another gaussian...).

Also, as someone pointed out, the gaussian is separable, so then the complexity of the direct convolution is (K N^2).

Anyway, what are FFT libraries like for iOS? I suppose with Apples policies, you can't compile FFTW for your App?

Last I played with FFT convolution it didn't end up being a win for any sane filter size, even compared to brute-force convolution with arbitrary non-separable kernels. Admittedly I was using off-the-shelf tools in Nuke, but I would imagine both their FFT and Convolve nodes are reasonably optimized.

There's an analysis of the 1D case here with some interesting cautions at the end: http://www.engineeringproductivitytools.com/stuff/T0001/PT15...

Yeah I imagine an fft solution is not the best due to non-mathmatical reasons. Kinda annoying as it was all worked out decades ago
I've done some research in doing live blurs and created my own hack solution [0]. The biggest bottleneck I've always seen people run into is doing the snapshot. Your solution looks really great, but I'm wondering how you would continuously snapshot the screen without running into issues? Or, have you found another solution.

Really great read!

[0]: http://whoisryannystrom.com/2013/09/17/Live-blur-in-iOS7/

There is also blur by fft. This way you get nlogn performance in the number of pixels. All unrelated to sigma and the size of your blur kernel.
Another opportunity for efficient blur would be to rescale the image if the blur radius is above a certain amount. This incurs overhead, so for smaller radii it might be more efficient to just blur the full-size image, but for larger radii you can first reduce the image, blur with a smaller radius, and then upscale the image.

You might need an Arbitrary Hack Constant (oh, sorry, "tuning parameter") to decide when to make that transition.

That's already in the article:

> Instead of filtering the image in its native resolution, I used OpenGL's native support for linear interpolation to downsample the source image by 4X in width and height, blurred that downsampled image, then linearly upsampled via OpenGL afterward.

Indeed it is, my bad for not RTFM
Are there other blurs that are cheaper than the Gaussian but look good enough to replace it for this sort of purpose?
I've used a triangular ("conical"?) kernel when in a hurry and it looked reasonable. Vastly better than a box blur, at least. They're actually not separable but one can pretend and just do horizontal and vertical passes without much damage.

Amusingly Autodesk ship a GLSL shader with Flame which uses a triangular kernel under the hood but labels the button in the interface "Gaussian", the cheeky swines...

The nicest option to imitate lens blur is a circular kernel but that's quite aggressively non-separable. The other thing that really helps is blurring linear-light values instead of the usual gamma-encoded images. The blurs in Windows 7's Aero themes very clearly don't do this, to my eye - dark lines spread out too much and look generally muddy.

VFX guy who loves to blur things here :)

The article mentions doing multiple passes with a box blur of different sizes. The author didn't see any performance improvement, though.
you can also downscale before blurring and upscale after blur. 4x without much noticeable difference in image quality. This gives a speed boost in two respects - less pixels and a smaller radius
This is already mentioned in the article.
You can do Gaussian filter in 2 passes: first along X, then along Y. Then you only grow linearly in size, rather than quadratically. This change should also improve your cache hits.
Mentioned in the article.