| HN Mirror

That's right - pure splitting is much more SIMD-friendly than..the whole syntax melange. This is another charm to the "partitioned" design. The conversion to split-parseable TSV can run on its own CPU core and the SIMD splitting on its own core. As long as pipe bandwidth suffices, you have very easy parallelism. This kind of design/intent was popular on Unix at the dawn of multiprocessing when there were still Giant Kernel Locks. But it still has merits, even on Windows.

If you have a lot of data (and space for it, e.g. in /dev/shm) you can also save all the converted data to a TSV file. That's now soundly "partitionable" at the "nearest ASCII newline to 1/N bytes" and you can then go core-parallel as well as SIMD within cores (but that N-wise pass with memory mapping or mem.views). Admittedly this is probably more helpful when you are doing more computation than just splits, like ASCII-to-binary conversion of fields or such.

Plus, someone might have actually exported from Excel (or whatever) into some sound TSV instead of weird quote-escaped-CSV that some think is standardized by rfc 4180 (which itself disavows being a "standard"). In that case, at least, you needn't convert at all.

So, I see at least 3 reasons to layer this part of a system as a convert-then-split: pipeline parallelism, file parallelism, and entire pass elision.