|
> Structs are aligned to the member with the strictest alignment requirement, so a struct of a `uint64_t` and a `uint32_t` will be aligned on an 8-byte boundary, meaning its size will be 128 bits. Don't most C compilers support a pragma to control this? "#pragma pack(4)" for clang and gcc, I believe. Given this (where I've made it add two arrays of 96-bit integers to make it easier to figure out the sizes in the assemply): #include <stdint.h>
#pragma pack(4)
struct block_addr {
uint64_t low;
uint32_t high;
};
int sum(struct block_addr * a, struct block_addr * b, struct block_addr * c)
{
for (int i = 0; i < 8; ++i)
{
c->low = a->low + b->low;
c++->high = a++->high + b++->high;
}
return 17;
}
here is the code for the loop body, which the compiler unrolled to make it even easier to see how the structure is laid out: movq (%rbx), %rax
addq (%r15), %rax
movq %rax, (%r14)
movl 8(%rbx), %eax
addl 8(%r15), %eax
movl %eax, 8(%r14)
movq 12(%rbx), %rax
addq 12(%r15), %rax
movq %rax, 12(%r14)
movl 20(%rbx), %eax
addl 20(%r15), %eax
movl %eax, 20(%r14)
movq 24(%rbx), %rax
addq 24(%r15), %rax
movq %rax, 24(%r14)
movl 32(%rbx), %eax
addl 32(%r15), %eax
movl %eax, 32(%r14)
...
movq 84(%rbx), %rax
addq 84(%r15), %rax
movq %rax, 84(%r14)
movl 92(%rbx), %eax
addl 92(%r15), %eax
movl %eax, 92(%r14)
(Some white space added, and the middle cut out). The 96-bit inters are now only taking up 96-bits. |
Changing the loop to 4 iterations for compactness' sake, (aligned) structs of two u64s generate the following, vectorized code:
https://godbolt.org/g/jB4jki
And if the pointer arguments are declared `restrict`, the loop can be vectorized even more aggressively: Either of which is much more efficient than the code generated for unaligned, packed 96-bit structs: A smaller cost is that in non-vector code, using a 64-bit register (rax) in 32-bit mode (eax) is wasting half of the register.IIRC, unaligned loads and stores will also, at the hardware level, stall the pipeline and inhibit out-of-order execution.