Hacker News new | ask | show | jobs
by theendisney 805 days ago
Fun! Convert to js:

csv.split('\n").join('".split(\',\'));a.push("');

So that each line becomes:

a.push("foo,bar,baz".split(','));

And then we have an array of arrays.

2 comments

Of course the “hard part” of CSV parsing is dealing with escapes, which break simple splits.

But now I’m wondering if a good approach might be to split on the escape character and then reassemble / parse from there, safe in the knowledge that every character has exactly one interpretation.

"Normal" CSV doesn't have escape characters. Quotes in quoted strings are escaped by doubling then, and everything else (including newlines) is interpreted as is inside quoted strings.
There is no spec or standard or consensus on “Normal” CSV.

I like it when CSV follows RFC 4180 too - but it’s descriptive not prescriptive.

There is no normal csv! I always used Excel as the “standard” when writing a CSV parser.

If every field is quoted you can indeed remove the first and last “, then split on “,“ and then replace “” with “ in the fields. Excuse my phone converting the quotes!

That is precisely why I put "normal" in quotes.

Nevertheless, if there is a way to escape anything at all, usually it is the quotation mark, and usually it is escaped by doubling. Pretty much any other scheme is very unlikely to be properly interpreted in this context.

Yes indeed. To make it easy to parse everything has to be quoted. If some things are quoted then you can’t just split on comma because for example:m

    “, is a cat”,”, is my boyfriend”,123
etc.
I think of what you say as really just the first step on the path to the parsing state machine in the c2tsv.nim (or /c2dsv.c in the same folder) thing I mentioned above which have comments in their source code.

I think it helps to think of the problem more like "How do I translate a complex syntax buffered input stream which 'most' of the time just translates ',' to '\t' into a buffered output stream that is "almost" as fast as a Unix `tr , \\t`?" If there were no escaping/quoting the output buffer could literally be the same memory as the input, just with the delimiter bytes changed.

The next step is realizing that you can still just do this byte translation if you "flush" the IO buffer opportunistically at syntactically relevant times. That gets you the "almost" performance. (Scare quotes on "almost" since you might do a few more IO-system calls with certain kinds of dense syntax, but unlike your "reassemble" there won't be any allocations. Various trade-offs, but a nifty design.)

There are other nice aspects to the "partitioned program design" mentioned a sibling-ish comment, but, all together, I think it is a pretty tidy solution.

In my sillyness I forgot to brag about the output still being csv.

If you want to enjoy all the strange escapery, extra commas, line breaks and wrapping quotes, you may describe it in code in the first and last 2 columns.

Is also expressible (and is vectorized) in C#. But that's author's code, not mine :)
That's right - pure splitting is much more SIMD-friendly than..the whole syntax melange. This is another charm to the "partitioned" design. The conversion to split-parseable TSV can run on its own CPU core and the SIMD splitting on its own core. As long as pipe bandwidth suffices, you have very easy parallelism. This kind of design/intent was popular on Unix at the dawn of multiprocessing when there were still Giant Kernel Locks. But it still has merits, even on Windows.

If you have a lot of data (and space for it, e.g. in /dev/shm) you can also save all the converted data to a TSV file. That's now soundly "partitionable" at the "nearest ASCII newline to 1/N bytes" and you can then go core-parallel as well as SIMD within cores (but that N-wise pass with memory mapping or mem.views). Admittedly this is probably more helpful when you are doing more computation than just splits, like ASCII-to-binary conversion of fields or such.

Plus, someone might have actually exported from Excel (or whatever) into some sound TSV instead of weird quote-escaped-CSV that some think is standardized by rfc 4180 (which itself disavows being a "standard"). In that case, at least, you needn't convert at all.

So, I see at least 3 reasons to layer this part of a system as a convert-then-split: pipeline parallelism, file parallelism, and entire pass elision.