|
|
|
|
|
by premchai21
5182 days ago
|
|
I wish I had time to write a more thorough response right now, but I just did a short test with Debian sid and its GCC 4.6.3 on a modern Xeon machine under Xen (so, not the best performance testing device, so take this with some salt). At -O9, the compiler optimizes a masks-and-shifts swap of a uint64_t into a bswapq instruction identical to the one emitted by the GCC-specific __builtin_bswap64; this can be coupled with an initial memcpy into a temporary uint64_t. Loading individual bytes and shifting them in emits a pile of instructions that take up 16 times as much code space and ~35% runtime penalty (2.7 s versus 2 s). This is measured in a loop decoding a big-endian integer into a native uint64_t and writing it to a volatile extern uint64_t global, 2^30 iterations, function called through a function pointer. Aligned versus unaligned pointers seem to make no real difference on this CPU, using a static __attribute__((aligned(8))) uint8_t[16] and offsets of 0 (aligned) and 5 (unaligned) from the start of the array. I also tried a function with the explicit cast-shift-or that uses an initial memcpy into a local uint8_t[8] in case the compiler was doing something strange with regard to memory read fault ordering as compared to the explicit memcpy in the two bswapq-generating versions. This resulted in some very "interesting" code that shoves the local array into a register and then very roughly masks and shifts all the bits around, at about a 100% penalty from the bswapq functions. :-( If anyone's interested in the details, reply and I'll try to put them somewhere accessible, though it may take a little while. |
|