|
I've been using a lot of inline assembly lately, and while the Stockholm syndrome might be in effect, I'm coming to like the GCC syntax. For me, main thing that has helped has been to adopt a consistent syntax. Here's some examples of what I'm currently using for an AVX2 popcnt optimization, with some explanation. #define ASM_VEC_BYTE_COUNT_SET(vec, sum, mask, shuf) \
__asm volatile ("vpsrld $4, %[VEC], %[SUM]\n" \
"vpand %[MASK], %[VEC], %[VEC]\n" \
"vpand %[MASK], %[SUM], %[SUM]\n" \
"vpshufb %[VEC], %[SHUF], %[VEC]\n" \
"vpshufb %[SUM], %[SHUF], %[SUM]\n" \
"vpaddb %[VEC], %[SUM], %[SUM]\n" : \
/* rd/wr ymm */ [VEC] "+&x" (vec), \
/* write ymm */ [SUM] "=&x" (sum) : \
/* read ymm */ [MASK] "x" (mask), \
/* read ymm */ [SHUF] "x" (shuf))
1) Try to use the %[symbolic] syntax rather than %[n] numeric. It's slightly longer to write, but usually clearer to read. Use upper case for the symbolic name. Put your inputs one per line, with a preceding comment.2) If you are using the same assembly more than once in your program, declare your assembly within a #define macro, then use the macro in your code. 3) Use "__asm volatile". Declaring "volatile" is not required, but once you are writing inline assembly you usually know more than the compiler about where the block should go. 5) If you have multiple lines of assembly and output registers, you are almost always safer to use "+&" and "=&" for your constraint rather than just "+" or "=". Search for "early clobber" for details. 6) Strongly prefer single type constraints. The more flexibility you give the compiler, the more likely it will defeat your efforts at optimization. Use explicit memory addressing modes rather than "m". The modifier "c" is needed for the offset. #define ASM_VEC_LOAD_OFFSET_MEM(off, mem, vec) \
__asm volatile ("vmovdqu %c[OFF](%[MEM]), %[VEC]\n" : \
/* destination */ [VEC] "=x" (vec) : \
/* byte offset */ [OFF] "i" (off), \
/* mem address */ [MEM] "r" (mem))
7) The register constraints for vectors are tricky, because the "x" constraint is used for both XMM and YMM vectors. There is no way to specify that one wants only one or the other. This sort of makes sense, since in hardware they share the same register. You can use the "q" modifier when you need to specify XMM syntax in the output when you need both forms of the same vector. |
5 - I can't think of any meaning early clobber has on an input+output constraint ("+")?
6 - there are many cases where you really do want to give the compiler flexibility in addressing modes. Unfortunately clang tends to ignore that and generate (reg) regardless.
7 - not really different than GPRs; you use "r" as the constraint then a modifier like "k" for the size.
I guess the lesson is that yeah gcc inline asm is powerful, but they try to leave it undocumented for a reason. Also, who stole number 4?