Hacker News new | ask | show | jobs
by zwegner 1646 days ago
Pretty similar article from very recently: https://nullprogram.com/blog/2021/12/04/

Discussion: https://news.ycombinator.com/item?id=29439403

The article mentions in an addendum (and BeeOnRope also pointed it out in the HN thread) a nice CLMUL trick for dealing with quotes originally discovered by Geoff Langdale. That should work here for a nice speedup.

But without the CLMUL trick, I'd guess that the unaligned loads that generally occur after a vector containing both quotes and newlines in this version (the "else" case on lines 34-40) would hamper the performance somewhat, since it would eat up twice as much L1 cache bandwidth. I'd suggest dealing with the masks using bitwise operations in a loop, and letting i stay divisible by 16. Or just use CLMUL :)

3 comments

Hi, I'm one of the authors of the post

Thanks for pointing us to CLMUL, I'm not familiar with these kind of multiplications, but, converting the quote bitmask to a quoted bitmask would certainly make it faster. With this new bitmask, we could negate it and AND it with the newline mask, generating a mask of newlines that are not inside quotes. Getting the last newline then would be a simple CLZ of that mask. And there wouldn't be a need to resort to byte to byte processing.

In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

PMOVMSKB/BSF/POPCNT takes serious wizardry, but instructions like PCLMULLQLQDQ make you feel like Gandalf. It's defined:

    pair clmul(uint64_t a, uint64_t b) {
      uint64_t t, x = 0, y = 0;
      if (a && b) {
        if (bsr(a) < bsr(b)) t = a, a = b, b = t; /* optional */
        for (t = 0; b; a <<= 1, b >>= 1) {
          if (b & 1) x ^= a, y ^= t;
          t = t << 1 | a >> 63;
        }
      }
      return (pair){x, y};
    }
There's a famous paper on how it can perform polynomial division at 40gbps. It's really cool that it has practical applications in things like CSV too. https://www.intel.com/content/dam/www/public/us/en/documents...
CLMUL in general is a bit weird to wrap your head around, but a CLMUL with -1 isn't too tricky: it's like a running 1-bit sum, or in other words, each bit in the result is the parity of all the bits up to that point in the multiplier.

> In our tests, going byte to byte for more iterations to keep the alignment when hitting the "else case" performed worse than making the unaligned loads, but as you say "just use CLMUL" (as all loads will be aligned) :D

I was talking about using bitwise operations with the quote/escape/newline masks already computed (like in the blog post I linked), rather than a byte-by-byte loop. But yeah, CLMUL is better anyways :)

CLMUL is quite interesting. I learned about it when going in depth on how multiplications help with hashing.

A multiplication is in practice: - a sum over - a series (i.e. one for each bit set in the multiplier) - of shifts (where the shift amount is the index of that bit in the multiplier)

The shifting and the combining are great for hashing as they "distribute" each bit around.

CLMUL simply replaces the addition in step one with xor (which can also be thought as the single bit carryless addition).

Even with the CLMUL trick, CSV parsing does not play nice. It can be made to work for JSON parsing because you can make more assumptions. With CSV, it only works smoothly if you are willing to accept a subset of what most spreadsheet programs accept i.e. to assume your CSV is "well-formed". Considering for example that the following three cells:

AA"A,"AA""A","A"A"A

when opened in Excel will all give you the same value, using CLMUL to normalize will require many repeated additional SIMD operations-- probably at least 8 if not more. At some vector size it will be worth it, but not clear at 256. The irony is, if you are stuck with CSV input, then the fact that you couldn't get a better format/encoding also suggests that you can't assume your CSV is "well-formed"

That's not what Python and Google Sheets do.

    >>> list(csv.reader(['''AA"A,"AA""A","A"A"A'''], dialect='excel'))
    [['AA"A', 'AA"A', 'AA"A']]
Has the CSV format been standardized somewhere?
What I can tell you is that if you save a file with 'AA"A,"AA""A","A"A"A' (excluding the surrounding single-quotes) and then double-click to open in Excel, you get 3 cells with the exact same values. Furthermore, if you run `echo 'AA"A,"AA""A","A"A"A' | xsv select 1,2,3` you again get the same 3 values. For people working with CSV, it's far more likely that the user cares more about consistency with Excel than consistency with some python lib or with Google sheets-- neither of which are used much, compared to Excel, in the worlds where CSV tends to reside (at least in my experience)
Huh? your comment proves exactly what my prior post said, which is that you end up with 3 equal values that were each represented, in the input, in different ways. Looks like csv.reader + dialect=excel is doing exactly that.
Good points all around. Not sure what the OP's requirements are, but judging by their current code, CLMUL should do nicely (or they have a bug).

And also, thanks for that example. Clearly I don't know CSV well enough--are quotes in fields that don't start with a quote not special?

If their data comes from a controlled bubble and they need not assume "real-world" data, then CLMUL might do nicely but best case it would likely only be a marginal improvement (and even then I would be willing to bet, not at anything less than 512 bit vector sizes). Best case, it still needs additional vector calls to support quoting. Obviously, if no quoting will be supported, it's even simpler, but then you also cannot support commas or newlines inside of cell values and are getting so far from "CSV" that you might as well just say you have pipe-delimited data which happens to use comma instead of pipe in which case you don't need CLMUL. If quoting needs to be supported, CLMUL will still require a number of repeated passes, shifts etc to deal with the various cases including an escaped first quote char, last quote char, non-first-or-last quote char and embedded commas/newlines.
Thread's getting a bit old, hope you see this :)

Referring to the OP, I'm not sure what exact dialect of csv they're using, but its quoting rules are nothing like Excel's: quotes are escaped with a separate character (possibly backslash), and quoted regions can start anywhere, not just the start of a field. This makes CLMUL much easier to apply, quite similarly to how it's used in simdjson. Finding escaped characters can be done branchlessly with a handful of scalar operations on the masks (this code is in simdjson too).

For proper Excel-style parsing, you're quite possibly right (and looking at ZSV, I trust that you've thought about this problem a lot more than me). I'm not certain that it can't all be done efficiently with very few branches, though, using mostly SIMD and some scalar operations on masks. CLMUL might not be useful here, since quotes are treated completely differently depending on whether the field started with a quote. Instead you'd have a first phase basically looking for a comma or newline followed by a quote, then looking for the next unescaped quote. Escaped quotes within quoted fields can be found with a similar sequence to simdjson's backslash support (and removed with something like vcompressb on newer AVX-512 chips).

Saw it :)

> done efficiently with very few branches, though, using mostly SIMD and some scalar operations on masks [...] first phase basically looking for a comma or newline followed by a quote, then looking for the next unescaped quote

Yes that's almost exactly what ZSV does, except it just looks for any interesting char (i.e. delim or dbl-quote), irrespective of position relative to other interesting char, and then let's the non-SIMD loop figure out to do with the chunks in between. We tried pretty hard to use CLMUL-- ZSV was designed after a decent-enough digestion of Lemire and Langdale's work to try those techniques-- but the json vs CSV differences were too much, at least for us, to overcome due to the extra 8+ vector calls, without giving up more than the benefit. Furthermore, if you are going to "normalize" the value, even more of that benefit gets lost (in other words, if you read 'aa"aa,"aa""aa"', which are equivalent, and you want to spit out two equivalent values, then you will have to modify the second value before printing non-CSV, or modify the first value to print well-formed CSV, and either way you can't do that more efficiently with vector operations). Since we were not willing to give up the CSV edge cases, we were also not able to find any benefit from CMUL. That's not to say we turned over every stone-- maybe there is a way and we just missed it.

> are quotes in fields that don't start with a quote not special?

Sorry missed that q before. Yes, that is correct, at least according to how Excel parses with dbl-click-to-open (as opposed to doing an import operation from a menu). They only serve as a delimiter when they are strictly the first char. So: 'first cell, " second cell ,third cell' gives you 3 cells, the second of which starts and ends with a space and has a double-quote as its second char (i.e. the equivalent of '" "" second cell "')

As an aside, we made a deliberate decision to be consistent with Excel, for reason other than that in our experience, the users who are least able to cope with edge cases on their own, and therefore rely the most on the software to do what they want/expect-- are also most likely to accept Excel as the "standard", whereas users who want/need a different standard than Excel are also most likely to be able to be able to come up with some solution (e.g. python or other) if their need isn't met by the default behavior of a given software tool.

Yeah, I see how the quoting complications make CLMUL pretty hard to use (or just useless). I have a few ideas about how to deal with the quoting/escaping, and if I get some free time in the near future I'll try and code something up.

I've also pinged Geoff and Daniel on Twitter, linking your library: https://mobile.twitter.com/zwegner/status/147340073352554497... ...maybe they'll have some ideas. (Geoff has mentioned that he had done some initial work on a simdcsv)

I'll also drop some more links that you might've already seen, but could be useful otherwise. For removing double quotes, rather than memmove, you could use the "pack left" operation on vectors. There's a handful of ways to do this pre-AVX-512-VBMI2 (where the magical vcompressb instruction was added), like in these SO questions:

https://stackoverflow.com/questions/36932240/avx2-what-is-th...

https://stackoverflow.com/questions/45506309/efficient-sse-s...

And for keeping the main lexer structure that ZSV uses now, but using a small DFA to maintain the lexer state, this other trick from Geoff might work: https://branchfree.org/2018/05/25/say-hello-to-my-little-fri...

The carryless multiplication instructions are amazing and people should use them more often. They are just so poorly explained that they feel like magic.