Hacker News new | ask | show | jobs
by twoodfin 805 days ago
Of course the “hard part” of CSV parsing is dealing with escapes, which break simple splits.

But now I’m wondering if a good approach might be to split on the escape character and then reassemble / parse from there, safe in the knowledge that every character has exactly one interpretation.

3 comments

"Normal" CSV doesn't have escape characters. Quotes in quoted strings are escaped by doubling then, and everything else (including newlines) is interpreted as is inside quoted strings.
There is no spec or standard or consensus on “Normal” CSV.

I like it when CSV follows RFC 4180 too - but it’s descriptive not prescriptive.

There is no normal csv! I always used Excel as the “standard” when writing a CSV parser.

If every field is quoted you can indeed remove the first and last “, then split on “,“ and then replace “” with “ in the fields. Excuse my phone converting the quotes!

That is precisely why I put "normal" in quotes.

Nevertheless, if there is a way to escape anything at all, usually it is the quotation mark, and usually it is escaped by doubling. Pretty much any other scheme is very unlikely to be properly interpreted in this context.

Yes indeed. To make it easy to parse everything has to be quoted. If some things are quoted then you can’t just split on comma because for example:m

    “, is a cat”,”, is my boyfriend”,123
etc.
I think of what you say as really just the first step on the path to the parsing state machine in the c2tsv.nim (or /c2dsv.c in the same folder) thing I mentioned above which have comments in their source code.

I think it helps to think of the problem more like "How do I translate a complex syntax buffered input stream which 'most' of the time just translates ',' to '\t' into a buffered output stream that is "almost" as fast as a Unix `tr , \\t`?" If there were no escaping/quoting the output buffer could literally be the same memory as the input, just with the delimiter bytes changed.

The next step is realizing that you can still just do this byte translation if you "flush" the IO buffer opportunistically at syntactically relevant times. That gets you the "almost" performance. (Scare quotes on "almost" since you might do a few more IO-system calls with certain kinds of dense syntax, but unlike your "reassemble" there won't be any allocations. Various trade-offs, but a nifty design.)

There are other nice aspects to the "partitioned program design" mentioned a sibling-ish comment, but, all together, I think it is a pretty tidy solution.

In my sillyness I forgot to brag about the output still being csv.

If you want to enjoy all the strange escapery, extra commas, line breaks and wrapping quotes, you may describe it in code in the first and last 2 columns.