Hacker News new | ask | show | jobs
by cb321 1495 days ago
It handles most cases, but maybe not arbitrary garbage that humans might be able to guess, but I don't think rfc4180 includes all those anyway. c2tsv is UTF8/binary agnostic. It just keys off ASCII commas, newlines, etc. Beats me how one ensures you handle anything the "same" way Excel does without actually running Excel's code somehow. { Maybe today, but next year or 10 years ago? } The little state machine could be extended, but it's hard to guess what the speed impact might be until you actually write said extensions.

From a performance perspective, strictly delimiter-separated values { again, ironically redundant ;-) } can be parsed with memchr. On Linux, memchr should be SIMD vectorized at least on x86_64 glibc via ELF 'i' symbols. So, while you give up SIMD on the "messy part" with a byte-at-a-time DFA, you regain it on the other side. (I have no idea if Apple gives you SIMD-vectorized memchr.)

Send to a file and segmentation (for parallel handling of segments) is also a simple application of memchr rather than needing an index of where rows start. You just split by bytes and find the next newline char. (Roughly). This can get you 16..128X speed-ups (today, anyway, on just one host) depending upon what you do.

Conversion to something properly byte-delimited basically restores whatever charm you might have thought ?SV had. I can only imagine a few corner cases where running directly off a complex format like quoted CSV makes sense ("tiny" data, "cannot/will not spend 2X space+must save input", "cannot/will not spend time to recompress", "running on a network fileysystem shared with those who refuse simplicity".) These cases are not common (for me). When they do happen, perf is usually limited by other things like network IO, start-up overheads, etc. Usually that little extra bit to write buffers out to a pipeline will either not matter or be outright immediately repaid in parallelism, parsing simplicity, or both.

Converting from any ASCII to even faster binary formats has a similar story, but usually with even more perf improvement (depending..) and more "choices" like how to represent strings [1]. Fully pre-parsed, the performance of conversion matters much less. (Whatever the ratio of processings per initial parse is.) Between both parallelism and ASCII->binary, however fast you make your serial zsv parser/ETL stuff, actual data analysis may still run 10,000 times slower than it could be on just 1 CPU (depending upon what throttles your workloads..you may only get 10000x for CPU local L1 resident nested loop stuff). { But we veer now toward trying to cram a databases course into an HN comment. :) And I'm probably repeating myself/others. Direct email from here may work better. }

[1] https://github.com/c-blake/nio

1 comments

> strictly delimiter-separated values

Sigh... If only everyone had used ASCII (and Unicode!) characters 30 and 31 for delimiters, since they are actual delimiter characters: https://en.wikipedia.org/wiki/Delimiter#ASCII_delimited_text

I don't think I've ever seen them in the wild. :-(