Hacker News new | ask | show | jobs
by nkurz 4707 days ago
That's really interesting, and an approach to such problems that I'd never considered. I was excited that a "Code generator for bit permutations" (http://programming.sirrida.de/calcperm.php) exists, but using a theorem prover is really another level of possibility. Now I need to figure out how to apply it to the problem I'm currently thinking about: http://stackoverflow.com/questions/17880178/how-do-i-sum-the...
1 comments

We can use the exact same approach used in the bit reversal trick of the article:

  ((x * 0x01010101) & 0xC0300C03) % 1023
This is probably not gonna be faster than the naive approach, though.
Thinking a little further about this, I believe using PSHUFB is the way to go, at least for when the count is large. This is because we can do 2 iterations in essentially one go (haven't tested the code, it's mostly a sketch):

  vmovdqa xmm0, [0, 1, 2, 3, 1, 2, 3, 4, 2, 3, 4, 5, 3, 4, 5, 6]
  vmovdqa xmm15, [0x0f, 0x0f, ..., 0x0f]
  vmovdqu xmm7, [rdi]
  _loop_body:
  vpand   xmm8, xmm15, [rdi]
  vpsrlw  xmm9, xmm7, 4
  vpand   xmm9, xmm9, xmm15
  vpshufb xmm8, xmm0, xmm8
  vpshufb xmm9, xmm0, xmm9
  vpaddb  xmm8, xmm8, xmm9

  vpshufb xmm7, xmm8, xmm8 ; since sum <= 12, we already have the next sum in the vector!
                           ; xmm7[0] = xmm8[xmm8[0]]
  vpaddb  xmm8, xmm8, xmm7 ; add it

  vpextrb eax, xmm8, 0
  vmovdqu xmm7, [rdi + rax]
  add rdi, rax
  sub esi, 2
  jnz _loop_body
This is likely extendable to 32-byte vectors with AVX2; have not thought much about that case.
This is exciting, but I think I may have tricked you on a couple of details. The actual distance to the next 'key' is 'sum + 5': 00 represents a 1 byte encoding, not zero, so the minimum offset to the next 'key' is 5. Thus the maximum offset is actually 17, which means one can't depend on having two keys within a single 16 byte vector. I'm trying to figure out if there is some way to compensate for this without halving the performance.
Oh, I missed that. That makes things trickier, but I think we can still get away with something like

  vmovdqu xmm7, [rdi + rax + 5 - 1]
  vpinsrb xmm7, xmm7, [rdi + rax + 0], 0
without too much of a performance penalty. The adjustments to offsets then can be put into the shuffle tables, so there should be no further significant performance loss.
Yes, I think that should work to guarantee two per vector. I hadn't previously considered trying to do that, and appreciate the suggestion and the sketch. I think I have a slightly faster (7 cycle) approach doing one at at time using a 64-bit register as a lookup for the sum of the middle two fields, but this has good promise. Especially if we can get out one farther ahead, so instead of having the vector reload on the critical path, the unused portion of the current vector and a preload can be 'slid' into place. Do you know if there is a good way to simulate a PALIGN but with a non-immediate operand? This might get down to 9-10 cycles for two keys.
I have no idea how to simulate a variable PALIGNR on Intel chips without making the loop extremely slow.

On AMD (with XOP), it can be done using VPPERM, which can shuffle from 2 sources. We can do variable alignment like this:

  vpperm xmm0, xmm1, xmm2, [[0..31] + offset]
On second thought, we can possibly do something similar on Intel using 2 pshufb and a blend.
Thanks for looking at this! This is wonderful and succinct (presuming it works, I'm still staring at it), but the 20+ latency of DIV makes it impractical. I think I can do it with multiply, and, multiply, shift, but even 3 cycle latency for the multiplies is too much:

  key *= 0x04040400UL;
  key &= 0x30C30C00UL;
  key *= 0x00041041UL;
  key >>= 28;
Let me stare at your shuffling solution for a bit...
In the interest of achieving the simplest instruction sequence, I've also found (AVX2):

  key   = _pdep_u32(key, 0x03030303);
  key  *= 0x01010101;
  key >>= 24;