Hacker News new | ask | show | jobs
by davidm1729 1644 days ago
Hi, I'm one of the authors of the post

Thanks for pointing us to CLMUL, I'm not familiar with these kind of multiplications, but, converting the quote bitmask to a quoted bitmask would certainly make it faster. With this new bitmask, we could negate it and AND it with the newline mask, generating a mask of newlines that are not inside quotes. Getting the last newline then would be a simple CLZ of that mask. And there wouldn't be a need to resort to byte to byte processing.

In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

3 comments

PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions like PCLMULLQLQDQ make you feel like Gandalf. It's defined:

    pair clmul(uint64_t a, uint64_t b) {
      uint64_t t, x = 0, y = 0;
      if (a && b) {
        if (bsr(a) < bsr(b)) t = a, a = b, b = t; /* optional */
        for (t = 0; b; a <<= 1, b >>= 1) {
          if (b & 1) x ^= a, y ^= t;
          t = t << 1 | a >> 63;
        }
      }
      return (pair){x, y};
    }
There's a famous paper on how it can perform polynomial division at 40gbps. It's really cool that it has practical applications in things like CSV too. https://www.intel.com/content/dam/www/public/us/en/documents...
CLMUL in general is a bit weird to wrap your head around, but a CLMUL with -1 isn't too tricky: it's like a running 1-bit sum, or in other words, each bit in the result is the parity of all the bits up to that point in the multiplier.

> In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

I was talking about using bitwise operations with the quote/escape/newline masks already computed (like in the blog post I linked), rather than a byte-by-byte loop. But yeah, CLMUL is better anyways :)

CLMUL is quite interesting. I learned about it when going in depth on how multiplications help with hashing.

A multiplication is in practice: - a sum over - a series (i.e. one for each bit set in the multiplier) - of shifts (where the shift amount is the index of that bit in the multiplier)

The shifting and the combining are great for hashing as they "distribute" each bit around.

CLMUL simply replaces the addition in step one with xor (which can also be thought as the single bit carryless addition).