Hacker News new | ask | show | jobs
by MTGandP 4636 days ago
Can you elaborate on why -O3 often generates slower code? I've never heard that before.
2 comments

It is common knowledge among GCC users. In the past using -O3 was rare because it often generated downright broken code. There used to be an official warning about that.

The situation is better nowadays but still, as far as I know, no major Linux distro uses -O3 as the default for binary packages.

-O3 can generate slower code because of the aggressive inlining and loop unrolling enabled. These optimizations are very tricky because of their effect on cache use. Basically all that extra code can push other needed code/data out of the cache, which can cause a noticeable decrease in performance.

I think it's 'common knowledge' which has outlived it's relevance as I can't recall the last time I found -O2 outperforming -O3.

Practically every performance oriented open source program I come across also defaults to -O3 these days, or sometimes -Ofast which also enables -ffast-math.

>-O3 can generate slower code because of the aggressive inlining and loop unrolling enabled

-O3 turns on vectorization and inlining optimizations but I can't recall any loop unrolling options which are turned on at -O3.

-funroll-loops is not turned on at any of the -O (including -O3) levels due to it being one of the hardest to get right without any runtime data as basis (which is why the only option that turns it on is PGO - profile generated optimization).

Note that I'm talking about modern versions of GCC, if you are using GCC 4.21 on OSX then this (-O2 > -O3) may still typically be the case.

>The situation is better nowadays but still, as far as I know, no major Linux distro uses -O3 as the default for binary packages.

I'd say they typically use the upstream optimization settings.

>I think it's 'common knowledge' which has outlived it's relevance as I can't recall the last time I found -O2 outperforming -O3.

I can, was about 4 months ago with GCC 4.8.0.

>practically every performance oriented open source program I come across also defaults to -O3 these days

How large is your sample size there? I have only seen -O3 in the default makefiles of audio/video encoders. Those tend to be a natural fit for -O3. In contrast, here is the current makefile of my favorite "performance oriented" FOSS program:

http://repo.or.cz/w/luajit-2.0.git/blob_plain/HEAD:/src/Make...

CCOPT= -O2 -fomit-frame-pointer # Note: it's no longer recommended to use -O3 with GCC 4.x. # The I-Cache bloat usually outweighs the benefits from aggressive inlining.

>I can't recall any loop unrolling options which are turned on at -O3.

You are right (I just looked it up). Guess my memory failed me there.

>I'd say they typically use the upstream optimization settings

I wish! Packagers love to fool around with the upstream sources and makefiles to make them conform to whatever "standards" they have.

>How large is your sample size there? I have only seen -O3 in the default makefiles of audio/video encoders. Those tend to be a natural fit for -O3

Well I very much implied 'performance-oriented' programs as we where discussing 'performance' generated by compiler options, which indeed are a natural fit for -O3.

For which my 'sample size' would be software like encoders, archivers, emulators, 3d renderers etc.

Obviously there's little point in using -O3 on your text editor (yes, extreme example), basically for any non performance-oriented software -O3 will likely only serve to increase the binary size as any potential gains will be unnoticable.

>I wish! Packagers love to fool around with the upstream sources and makefiles to make them conform to whatever "standards" they have.

Not really my experience with Arch packages, but of course I haven't looked at the PKGBUILDS for even 1% of all available packages, basically only those performance oriented packages on which I rely.

-O3 inlines functions and unrolls loops more aggressively, so the increased code size might not fit in the CPU cache.
Fair enough... let me do a quick test with -O2 and see how that fares
Might want to also give a go at -Os (optimize for small code size). On code that spends its time iterating on the same code over and over again this can be a big win.

[edit] Nope, definitely not better. I get O2 being a slight win over O3 and Os being much worse.

Compiler optimization flags are very code and type specific.

(Note that I am comparing apples to oranges here, I used the C++ code used in Rust experiments found here: https://github.com/huonw/card-trace/blob/master/original.cpp )

I changed the C++ version typedef float f to typedef double f, so using floats instead of doubles, compiling with the following flags:

    -m64 -march=corei7-avx -mtune=corei7-avx -Ofast -funroll-all-loops
and the run time dropped down from 17.5 seconds to 11.2 seconds. If I remove -funroll-all-loops, the run time jumps to 14.2 seconds. The original 17.5 seconds were ran with vanilla code using float and -O3. Interestingly enough, if you use the aforementioned flags with floats instead of doubles, the program executes in 15.01 seconds instead. Using floats is bad for performance! Further, if you remove -funroll-all-loops when using floats, the performance increases, but with doubles it decreases.

So, when optimizing, play with compiler flags. Play with types. Play with whatever you have at your disposal and make no assumptions. This stuff is far more complex than believing that certain flags are better than others, it all depends on everything.

So it totally disables all loop unrolling, inlining... hmm
Does a couple of other things, including choosing instruction sequences that are more compact afaik. But also favouring compactness over alignment and obviously jumps over unrolling. Obviously this isn't code that benefits terribly much from it, but it has been known to happen.