Hacker News new | ask | show | jobs
by nailer 3422 days ago
Other tech folk have always talked about 32 bit support as a necessary evil since smaller types mean less memory.

The complexity of managing a secondary 32 bit environment has been worse than the memory usage of 64 bit apps for a very, very long time.

4 comments

That need has been met by the x32 ABI for some time now, it combines some of the best parts of the x86_64 arch with the lower memory consumption of 32bit (still limited to 4gb max memory though)

https://en.wikipedia.org/wiki/X32_ABI

Does anyone use x32 ABI though? I once asked and only crickets answered. I'm experimenting with my own Linux distribution and was wondering if it is worth the time investment.
> (still limited to 4gb max memory though)

4GB max address space per process. It should support much more memory through PAE [1], which makes things more reasonable.

[1]: https://en.wikipedia.org/wiki/Physical_Address_Extension

There are a lot of things that gives bigger memory usage and larger code size that modern compilers do that we could try to address.

* The stack is always kept aligned at 16 bytes boundary. This is needed for external calls by the ABI, but LTCG could drop these for internal calls and align the stack when needing SSE instead. This may be slightly more expensive than keeping the stack constantly 16-byte aligned, but it avoids wasting a lot of stack, so may very well be faster overall just by nature of less cache utilization.

* No push and pop, reserves needed stack space (even for function calls) in prologue and accesses stack with mov and lea instead. The full mov/lea instructions with mod/rm+sib takes up far more bytes that simple push and pop, but apparently it's faster.

* Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.

* Multiple versions of code copying, scanning or comparing arrays for handling different alignments. This seems quite stupid as there isn't even any penalty for unaligned accesses on modern x86 cpus except in some very specific circumstances[0]

These are all microoptimizations for getting the absolutely maximal performance out of tiny programs containing only hot code. However in reality programs rarely looks like that, and the increased code size and stack usage costs more than it gives. Profile guided optimizations is probably the way to go here, but distributed binaries have rarely if ever been compiled with PGO. Also I have no idea if PGO actually does drop these enlarging optimizations on non-hot codepaths on modern compilers.

[0]: http://lemire.me/blog/2012/05/31/data-alignment-for-speed-my...

> * Inefficient instructions are replaced with more efficient instructions. For example gcc will for a simple x % 19 generate no less than 16 instructions instead of a single div/idiv. This is probably still faster, but it may still be detrimental if it's not in a hot path. It should be noted that gcc emits this even at -O0.

Does it emit it at -Os ?

Curiously not.

-O0:

    main:
    .LFB0:
            .cfi_startproc
            pushq   %rbp
            .cfi_def_cfa_offset 16
            .cfi_offset 6, -16
            movq    %rsp, %rbp
            .cfi_def_cfa_register 6
            subq    $16, %rsp
            movl    %edi, -4(%rbp)
            movq    %rsi, -16(%rbp)
            movl    -4(%rbp), %ecx
            movl    $1808407283, %edx
            movl    %ecx, %eax
            imull   %edx
            sarl    $3, %edx
            movl    %ecx, %eax
            sarl    $31, %eax
            subl    %eax, %edx
            movl    %edx, %eax
            sall    $3, %eax
            addl    %edx, %eax
            addl    %eax, %eax
            addl    %edx, %eax
            subl    %eax, %ecx
            movl    %ecx, %edx
            movl    %edx, %esi
            movl    $.LC0, %edi
            movl    $0, %eax
            call    printf
            movl    $0, %eax
            leave
            .cfi_def_cfa 7, 8
            ret
            .cfi_endproc
-Os:

    main:
    .LFB13:
            .cfi_startproc
            pushq   %rax
            .cfi_def_cfa_offset 16
            movl    %edi, %eax
            movl    $19, %ecx
            cltd
            movl    $.LC0, %esi
            movl    $1, %edi
            idivl   %ecx
            xorl    %eax, %eax
            call    __printf_chk
            xorl    %eax, %eax
            popq    %rdx
            .cfi_def_cfa_offset 8
            ret
            .cfi_endproc
So it kinda performs an optimization when disabling all optimizations that it doesn't do when optimizing for size. Or well, the default codegen is the optimized version. Interesting.
x86 isn't just 32-bit. Linux distros have been reluctant to upgrade their notion of 32-bit x86 to require SSE2. This means that x86 isn't just 32-bit but involves weird 387 floating-point math when other mainstream architectures use IEEE floating-point math.
Interestingly however, Linux distros have been quick to adopt i686. I remember setting up a router on a small PC with a VIA processor several years ago. Inserted a Debian hard drive into that box — "cmov instruction not supported" instead of booting. The i586 version existed of course, but wasn't the default download IIRC.
Perhaps they're targeting people with old hardware?
Obviously, but everyone with more recent CPUs (as in only 14 years old) who runs x86 packages ends up running less efficient software.
If people want 32 bit pointers on 64 bit hardware they should pick the x32 ABI instead
...which is not supported by Arch either.
Even though it is not supported officially, you can install the x32 libraries from AUR: https://aur.archlinux.org/packages/?K=libx32
What do you mean? The announcement specifically says multilib is unaffected. I run 32-bit programs all the time.
Multilib is different concept from x32 ABI. Typical usecase for multilib is running i386 binaries that use i386 ABI on amd64 system (or sun4m binaries on sun4u, mips32 on mips64...), x32 ABI is alternate ABI for amd64 that uses 32bit pointers, but all other amd64 ISA extensions.
Today I learned something. Thanks!