Hacker News new | ask | show | jobs
by mattewong 1641 days ago
Saw it :)

> done efficiently with very few branches, though, using mostly SIMD and some scalar operations on masks [...] first phase basically looking for a comma or newline followed by a quote, then looking for the next unescaped quote

Yes that's almost exactly what ZSV does, except it just looks for any interesting char (i.e. delim or dbl-quote), irrespective of position relative to other interesting char, and then let's the non-SIMD loop figure out to do with the chunks in between. We tried pretty hard to use CLMUL-- ZSV was designed after a decent-enough digestion of Lemire and Langdale's work to try those techniques-- but the json vs CSV differences were too much, at least for us, to overcome due to the extra 8+ vector calls, without giving up more than the benefit. Furthermore, if you are going to "normalize" the value, even more of that benefit gets lost (in other words, if you read 'aa"aa,"aa""aa"', which are equivalent, and you want to spit out two equivalent values, then you will have to modify the second value before printing non-CSV, or modify the first value to print well-formed CSV, and either way you can't do that more efficiently with vector operations). Since we were not willing to give up the CSV edge cases, we were also not able to find any benefit from CMUL. That's not to say we turned over every stone-- maybe there is a way and we just missed it.

> are quotes in fields that don't start with a quote not special?

Sorry missed that q before. Yes, that is correct, at least according to how Excel parses with dbl-click-to-open (as opposed to doing an import operation from a menu). They only serve as a delimiter when they are strictly the first char. So: 'first cell, " second cell ,third cell' gives you 3 cells, the second of which starts and ends with a space and has a double-quote as its second char (i.e. the equivalent of '" "" second cell "')

As an aside, we made a deliberate decision to be consistent with Excel, for reason other than that in our experience, the users who are least able to cope with edge cases on their own, and therefore rely the most on the software to do what they want/expect-- are also most likely to accept Excel as the "standard", whereas users who want/need a different standard than Excel are also most likely to be able to be able to come up with some solution (e.g. python or other) if their need isn't met by the default behavior of a given software tool.

1 comments

Yeah, I see how the quoting complications make CLMUL pretty hard to use (or just useless). I have a few ideas about how to deal with the quoting/escaping, and if I get some free time in the near future I'll try and code something up.

I've also pinged Geoff and Daniel on Twitter, linking your library: https://mobile.twitter.com/zwegner/status/147340073352554497... ...maybe they'll have some ideas. (Geoff has mentioned that he had done some initial work on a simdcsv)

I'll also drop some more links that you might've already seen, but could be useful otherwise. For removing double quotes, rather than memmove, you could use the "pack left" operation on vectors. There's a handful of ways to do this pre-AVX-512-VBMI2 (where the magical vcompressb instruction was added), like in these SO questions:

https://stackoverflow.com/questions/36932240/avx2-what-is-th...

https://stackoverflow.com/questions/45506309/efficient-sse-s...

And for keeping the main lexer structure that ZSV uses now, but using a small DFA to maintain the lexer state, this other trick from Geoff might work: https://branchfree.org/2018/05/25/say-hello-to-my-little-fri...

Thank you for the ideas / links / mentions! I had not seen them and will check them out. Always looking for improvements!