| There are a lot of things that gives bigger memory usage and larger code size that modern compilers do that we could try to address. * The stack is always kept aligned at 16 bytes boundary. This is needed for external calls by the ABI, but LTCG could drop these for internal calls and align the stack when needing SSE instead. This may be slightly more expensive than keeping the stack constantly 16-byte aligned, but it avoids wasting a lot of stack, so may very well be faster overall just by nature of less cache utilization. * No push and pop, reserves needed stack space (even for function calls) in prologue and accesses stack with mov and lea instead. The full mov/lea instructions with mod/rm+sib takes up far more bytes that simple push and pop, but apparently it's faster. * Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0. * Multiple versions of code copying, scanning or comparing arrays for handling different alignments. This seems quite stupid as there isn't even any penalty for unaligned accesses on modern x86 cpus except in some very specific circumstances[0] These are all microoptimizations for getting the absolutely maximal performance out of tiny programs containing only hot code. However
in reality programs rarely looks like that, and the increased code size and stack usage costs more than it gives. Profile guided optimizations
is probably the way to go here, but distributed binaries have rarely if ever been compiled with PGO. Also I have no idea if PGO actually
does drop these enlarging optimizations on non-hot codepaths on modern compilers. [0]: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-my... |
Does it emit it at -Os ?