Hacker News new | ask | show | jobs
by dkarl 446 days ago
> I know I can quickly write a parser, not by reading some spec, but by looking at the actual CSV file

This is fine if you can hand-check all the data, or if you are okay if two offsetting errors happen to corrupt a portion of the data without affecting all of it.

Also I find it odd that you call it "easy" to write custom code to parse CSV files and translate between CSV formats. If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy." When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.

Like, I get that a farmer a couple hundred years ago would describe plowing a field with a horse as "easy," but given the emergence of alternatives, you wouldn't use the word in that context anymore.

2 comments

> If somebody give you a JSON file that isn't valid JSON, you tell them it isn't valid, and they say "oh, sorry" and give you a new one. That's the standard for "easy."

But it isn't that reliably easy with JSON. Sometimes I have clients give me data that I just have to work with, as-is. Maybe it was invalid JSON spat out by some programmer or tool long ago. Maybe it's just from a different department than my contact, which might delay things for days before the bureaucracy gets me a (hopefully) valid JSON.

I consider CSV's level of "easy" more reliable.

And even valid JSON can be less easy. I've had experiences where writing the high-level parsing for some JSON file, in terms of a JSON library, was less easy and more time-consuming than writing a custom CSV parser.

Subjectively, I think programming a CSV parser from basic programming primitives is just more fun and appealing than programming in terms of a JSON library or XML library. And I find the CSV code is often simpler and quicker to write.

> When there are many and diverse data formats that meet that standard, it seems perverse to use the word "easy" to talk about empirically discovering the quirks in various undocumented dialects and writing custom logic to accommodate them.

But the premise of CSV is so simple, that there are only four quirks to empirically discover: cell delimiter, row delimiter, quote, escaped-quote.

I think it's "easy" to peek at the file and say, "Oh, they use semicolon cell delimiters."

And it's likewise "easy" to write the "custom logic", which is about as simple as parsing something directly from a text stream gets. I typically have to stop and think a minute about the quoting, but it's not that bad.

If a programmer is practiced at parsing from a text stream (a powerful, general skill that is worth exercising), than I think it is reasonable to think they might find parsing CSV by hand to be easier and quicker than parsing JSON (etc.) with a library.