Hacker News new | ask | show | jobs
by infberg 1712 days ago
Care to explain? -O3 generates larger code than -O2?
1 comments

Yes, -O3 tends to include a lot of features that increase code size, like aggressive loop unrolling. If you are jumping around a large amount of code, -O3 generally performs more poorly than -O2, but if you are running a tight loop (like HPC code), -O3 is better.

In the past, at a time when I worked on a very performance sensitive codebase that was also limited in scope, we compiled with -Osize and did all the loop optimizations we wanted manually (and with pragmas). That produced faster code than -O2 or -O3.

Regarding unrolling, -O3 contains -funroll-and-jam but not -funroll-loops. You may want one or the other, maybe both, depending on circumstances. I don't see much benefit from the available pragmas on HPC-type code unless for OpenMP, and "omp simd" isn't necessary to get vectorization in the places I've seen people say it is. Mileage always varies somewhat, of course. (Before second-guessing anything, use -fopt-info.)
Modern x86 CPUs have micro instr caches to store small loops (about 50 instr) and medium loops (~2k instr). Also, the bottleneck is usually the instruction decoding (Alder Lake made huge changes on that, so this might change).

In other words, loop unrolling is, more often than not, harmful.

It’s a shame that Osize can sometimes produce truly awful code. There are a few optimisations in there that trade a byte for a massive slowdown.
You asked for minimum size, and that's what you got. I'd say that's working as it should.

A more granular control over optimisation would be good, however.

Probably just some tweaks to O2 would be enough, after all people are selecting Os over O2 because they see better performance, and that should not be happening.
You can enable/disable individual optimizations. How much more granular do you need?
Surely a profile-guided build should be able to only apply -Os to those functions where it doesn't cause a lot of problems.
In the application I referred to, PGO was also used. However, that only applies -Os to cold code, and if what you're doing is very branchy, it can help even in the hot path.