| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by poizan42 3422 days ago

There are a lot of things that gives bigger memory usage and larger code size that modern compilers do that we could try to address.

* The stack is always kept aligned at 16 bytes boundary. This is needed for external calls by the ABI, but LTCG could drop these for internal calls and align the stack when needing SSE instead. This may be slightly more expensive than keeping the stack constantly 16-byte aligned, but it avoids wasting a lot of stack, so may very well be faster overall just by nature of less cache utilization.

* No push and pop, reserves needed stack space (even for function calls) in prologue and accesses stack with mov and lea instead. The full mov/lea instructions with mod/rm+sib takes up far more bytes that simple push and pop, but apparently it's faster.

* Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.

* Multiple versions of code copying, scanning or comparing arrays for handling different alignments. This seems quite stupid as there isn't even any penalty for unaligned accesses on modern x86 cpus except in some very specific circumstances[0]

These are all microoptimizations for getting the absolutely maximal performance out of tiny programs containing only hot code. However in reality programs rarely looks like that, and the increased code size and stack usage costs more than it gives. Profile guided optimizations is probably the way to go here, but distributed binaries have rarely if ever been compiled with PGO. Also I have no idea if PGO actually does drop these enlarging optimizations on non-hot codepaths on modern compilers.

[0]: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-my...

1 comments

aaronmdjones 3422 days ago

> * Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.

Does it emit it at -Os ?

link

poizan42 3420 days ago

Curiously not.

-O0:

    main:
    .LFB0:
            .cfi_startproc
            pushq   %rbp
            .cfi_def_cfa_offset 16
            .cfi_offset 6, -16
            movq    %rsp, %rbp
            .cfi_def_cfa_register 6
            subq    $16, %rsp
            movl    %edi, -4(%rbp)
            movq    %rsi, -16(%rbp)
            movl    -4(%rbp), %ecx
            movl    $1808407283, %edx
            movl    %ecx, %eax
            imull   %edx
            sarl    $3, %edx
            movl    %ecx, %eax
            sarl    $31, %eax
            subl    %eax, %edx
            movl    %edx, %eax
            sall    $3, %eax
            addl    %edx, %eax
            addl    %eax, %eax
            addl    %edx, %eax
            subl    %eax, %ecx
            movl    %ecx, %edx
            movl    %edx, %esi
            movl    $.LC0, %edi
            movl    $0, %eax
            call    printf
            movl    $0, %eax
            leave
            .cfi_def_cfa 7, 8
            ret
            .cfi_endproc

-Os:

    main:
    .LFB13:
            .cfi_startproc
            pushq   %rax
            .cfi_def_cfa_offset 16
            movl    %edi, %eax
            movl    $19, %ecx
            cltd
            movl    $.LC0, %esi
            movl    $1, %edi
            idivl   %ecx
            xorl    %eax, %eax
            call    __printf_chk
            xorl    %eax, %eax
            popq    %rdx
            .cfi_def_cfa_offset 8
            ret
            .cfi_endproc

So it kinda performs an optimization when disabling all optimizations that it doesn't do when optimizing for size. Or well, the default codegen is the optimized version. Interesting.

link