Hacker News new | ask | show | jobs
by premchai21 5182 days ago
I wish I had time to write a more thorough response right now, but I just did a short test with Debian sid and its GCC 4.6.3 on a modern Xeon machine under Xen (so, not the best performance testing device, so take this with some salt).

At -O9, the compiler optimizes a masks-and-shifts swap of a uint64_t into a bswapq instruction identical to the one emitted by the GCC-specific __builtin_bswap64; this can be coupled with an initial memcpy into a temporary uint64_t. Loading individual bytes and shifting them in emits a pile of instructions that take up 16 times as much code space and ~35% runtime penalty (2.7 s versus 2 s). This is measured in a loop decoding a big-endian integer into a native uint64_t and writing it to a volatile extern uint64_t global, 2^30 iterations, function called through a function pointer.

Aligned versus unaligned pointers seem to make no real difference on this CPU, using a static __attribute__((aligned(8))) uint8_t[16] and offsets of 0 (aligned) and 5 (unaligned) from the start of the array.

I also tried a function with the explicit cast-shift-or that uses an initial memcpy into a local uint8_t[8] in case the compiler was doing something strange with regard to memory read fault ordering as compared to the explicit memcpy in the two bswapq-generating versions. This resulted in some very "interesting" code that shoves the local array into a register and then very roughly masks and shifts all the bits around, at about a 100% penalty from the bswapq functions. :-(

If anyone's interested in the details, reply and I'll try to put them somewhere accessible, though it may take a little while.

1 comments

This isn't surprising. If the set the AC bit on x86, then it will disallow unaligned accesses and you'll be operating in an environment more similar to RISC machines. In order to allow such a thing to succeed, GCC can't produce a 32-bit read from char* address since the alignment is only guaranteed to be 1 (i.e. no alignment) and this would trigger SIGBUS. Thus, in order to get a 32-bit read, you must deref a 32-bit variable, not 4x 8-bit ones. This makes even more sense on RISC systems where this "optimization" would be a tragic bug you'd want to work around in your compiler. See my post with the x86 assembly output confirming your general results.