Might want to also give a go at -Os (optimize for small code size). On code that spends its time iterating on the same code over and over again this can be a big win.
[edit] Nope, definitely not better. I get O2 being a slight win over O3 and Os being much worse.
and the run time dropped down from 17.5 seconds to 11.2 seconds. If I remove -funroll-all-loops, the run time jumps to 14.2 seconds. The original 17.5 seconds were ran with vanilla code using float and -O3. Interestingly enough, if you use the aforementioned flags with floats instead of doubles, the program executes in 15.01 seconds instead. Using floats is bad for performance! Further, if you remove -funroll-all-loops when using floats, the performance increases, but with doubles it decreases.
So, when optimizing, play with compiler flags. Play with types. Play with whatever you have at your disposal and make no assumptions. This stuff is far more complex than believing that certain flags are better than others, it all depends on everything.
Does a couple of other things, including choosing instruction sequences that are more compact afaik. But also favouring compactness over alignment and obviously jumps over unrolling. Obviously this isn't code that benefits terribly much from it, but it has been known to happen.
[edit] Nope, definitely not better. I get O2 being a slight win over O3 and Os being much worse.